This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Your Migration Journey

A learning path for migrating to continuous delivery, built on years of hands-on experience helping teams remove friction and improve delivery outcomes.

1: Quality and Delivery Anti-Patterns

1.1: Team Workflow

1.1.1: Pull Request Review Bottlenecks
1.1.2: Work Items Too Large
1.1.3: No Vertical Slicing
1.1.4: Too Much Work in Progress
1.1.5: Push-Based Work Assignment

1.2: Branching and Integration

1.2.1: Long-Lived Feature Branches
1.2.2: No Continuous Integration

1.3: Testing

1.3.1: No Test Automation
1.3.2: Manual Regression Testing Gates
1.3.3: Flaky Test Suites
1.3.4: Inverted Test Pyramid

1.4: Pipeline and Infrastructure

1.4.1: No Pipeline Exists
1.4.2: Manual Deployments
1.4.3: Snowflake Environments

1.5: Organizational and Cultural

1.5.1: Change Advisory Board Gates
1.5.2: Pressure to Skip Testing

1.6: Monitoring and Observability

1.6.1: No Observability

1.7: Architecture

1.7.1: Tightly Coupled Monolith

2: Migrating Brownfield to CD

2.1: Document Your Current Process
2.2: Replacing Manual Validations with Automation

3: Migration Path

3.1: Phase 0: Assess

3.1.1: Value Stream Mapping
3.1.2: Baseline Metrics
3.1.3: Identify Constraints
3.1.4: Current State Checklist

3.2: Phase 1: Foundations

3.2.1: Trunk-Based Development
3.2.2: Testing Fundamentals
3.2.3: Build Automation
3.2.4: Work Decomposition
3.2.5: Code Review
3.2.6: Working Agreements
3.2.7: Everything as Code

3.3: Phase 2: Pipeline

3.3.1: Single Path to Production
3.3.2: Deterministic Pipeline
3.3.3: Deployable Definition
3.3.4: Immutable Artifacts
3.3.5: Application Configuration
3.3.6: Production-Like Environments
3.3.7: Pipeline Architecture
3.3.8: Rollback

3.4: Phase 3: Optimize

3.4.1: Small Batches
3.4.2: Feature Flags
3.4.3: Limiting Work in Progress
3.4.4: Metrics-Driven Improvement
3.4.5: Retrospectives
3.4.6: Architecture Decoupling

3.5: Phase 4: Deliver on Demand

3.5.1: Deploy on Demand
3.5.2: Progressive Rollout
3.5.3: Agentic CD
3.5.4: Experience Reports

4: CD for Greenfield Projects

5: Defect Sources
6: AI Adoption Roadmap
7: FAQ
8: Reference

8.1: Glossary
8.2: CD Dependency Tree
8.3: Common Blockers
8.4: DORA Capabilities
8.5: Resources
8.6: Metrics

8.6.1: Integration Frequency
8.6.2: Build Duration
8.6.3: Development Cycle Time
8.6.4: Lead Time
8.6.5: Change Fail Rate
8.6.6: Mean Time to Repair
8.6.7: Release Frequency
8.6.8: Work in Progress

8.7: Testing

8.7.1: Unit Tests
8.7.2: Integration Tests
8.7.3: Functional Tests
8.7.4: End-to-End Tests
8.7.5: Contract Tests
8.7.6: Static Analysis
8.7.7: Test Doubles

This guide is a learning path built on years of helping teams across industries remove friction, improve delivery outcomes, and raise team morale through continuous delivery. It expands on the practices defined at MinimumCD.org and the production-tested playbooks from the Dojo Consortium, grounded in hands-on application of one driving question: “Why can’t I deliver today’s work to production today?” Start with the problem your team feels most, then follow the path to solving it.

Where to Start

Anti-Patterns - Find the problems your team is facing and learn the concrete steps to fix each one.
Brownfield CD - Already have a running system? A phased approach to migrating existing applications and teams to continuous delivery.

Content Sources

This guide adapts content from two CC BY 4.0 licensed sources:

MinimumCD.org - The minimum practices for continuous delivery
Dojo Consortium - Production-tested playbooks for delivering software better

Each adapted page includes attribution to its source material.

1 - Quality and Delivery Anti-Patterns

Start here. Find the anti-patterns your team is facing and learn the path to solving them.

Every team migrating to continuous delivery faces obstacles. Most are not unique to your team, your technology, or your industry. This section catalogs the anti-patterns that hurt quality, increase rework, and make delivery timelines unpredictable - then provides a concrete path to fix each one.

Start with the problem you feel most. Each page links to the practices and migration phases that address it.

Anti-pattern index

Sorted by quality impact so you can prioritize what to fix first.

Anti-pattern	Category	Quality impact
Long-Lived Feature Branches	Branching & Integration	Quality Impact: Critical
No Continuous Integration	Branching & Integration	Quality Impact: Critical
No Test Automation	Testing & Quality	Quality Impact: Critical
Manual Regression Testing Gates	Testing & Quality	Quality Impact: Critical
No Pipeline Exists	Pipeline & Infrastructure	Quality Impact: Critical
Pull Request Review Bottlenecks	Team Workflow	Quality Impact: High
Work Items Too Large	Team Workflow	Quality Impact: High
Too Much Work in Progress	Team Workflow	Quality Impact: High
Push-Based Work Assignment	Team Workflow	Quality Impact: High
Flaky Test Suites	Testing & Quality	Quality Impact: High
Inverted Test Pyramid	Testing & Quality	Quality Impact: High
Manual Deployments	Pipeline & Infrastructure	Quality Impact: High
Snowflake Environments	Pipeline & Infrastructure	Quality Impact: High
Change Advisory Board Gates	Organizational & Cultural	Quality Impact: High
Pressure to Skip Testing	Organizational & Cultural	Quality Impact: High
No Observability	Monitoring & Observability	Quality Impact: High
Tightly Coupled Monolith	Architecture	Quality Impact: High
No Vertical Slicing	Team Workflow	Quality Impact: Medium

1.1 - Team Workflow

Anti-patterns in how teams assign, coordinate, and manage the flow of work.

These anti-patterns affect how work moves through the team. They create bottlenecks, hide problems, and prevent the steady flow of small changes that continuous delivery requires.

1.1.1 - Pull Request Review Bottlenecks

Pull requests sit for days waiting for review. Reviews happen in large batches. Authors have moved on by the time feedback arrives.

Category: Team Workflow | Quality Impact: High

What This Looks Like

A developer opens a pull request and waits. Hours pass. A day passes. They ping someone in chat. The reviewer is busy with their own work. Eventually, late in the afternoon or the next morning, comments arrive. The author has moved on to something else and has to reload context to respond. Another round of comments. Another wait. The PR finally merges two or three days after it was opened.

Common variations:

The aging PR queue. The team has five or more open PRs at any given time. Some are days old. Developers start new work while they wait, which creates more PRs, which creates more review load, which slows reviews further.
The designated reviewer. One or two senior developers review everything. They are overwhelmed. Their review queue is a bottleneck that the rest of the team works around by starting more work while they wait.
The drive-by approval. Reviews are so slow that the team starts rubber-stamping PRs to unblock each other. The review step exists in name only. Quality drops, but at least things merge.
The nitpick spiral. Reviewers leave dozens of style comments on formatting, naming, and conventions that could be caught by a linter. Each round triggers another round. A 50-line change accumulates 30 comments across three review cycles.
The “I’ll get to it” pattern. When asked about a pending review, the answer is always “I’ll look at it after I finish this.” But they never finish “this” because they have their own work, and reviewing someone else’s code is never the priority.

The telltale sign: the team tracks PR age and the average is measured in days, not hours.

Why This Is a Problem

Slow code review is not just an inconvenience. It is a systemic bottleneck that undermines continuous integration, inflates cycle time, and degrades the quality it is supposed to protect.

It blocks continuous integration

Trunk-based development requires integrating to trunk at least once per day. A PR that sits for two days makes daily integration impossible. The branch diverges from trunk while it waits. Other developers make changes to the same files. By the time the review is done, the PR has merge conflicts that require additional work to resolve.

This is a compounding problem. Slow reviews cause longer-lived branches. Longer-lived branches cause larger merge conflicts. Larger merge conflicts make integration painful. Painful integration makes the team dread merging, which makes them delay opening PRs until the work is “complete,” which makes PRs larger, which makes reviews take longer.

In teams where reviews complete within hours, branches rarely live longer than a day. Merge conflicts are rare because changes are small and trunk has not moved far since the branch was created.

It inflates cycle time

Every hour a PR waits for review is an hour added to cycle time. For a story that takes four hours to code, a two-day review wait means the review step dominates the total cycle time. The coding was fast. The pipeline is fast. But the work sits idle for days because a human has not looked at it yet.

This wait time is pure waste. Nothing is happening to the code while it waits. No value is being delivered. The change is done but not integrated, tested in the full pipeline, or deployed. It is inventory sitting on the shelf.

When reviews happen within two hours, the review step nearly disappears from the cycle time measurement. Code flows from development to trunk to production with minimal idle time.

It degrades the review quality it is supposed to protect

Slow reviews produce worse reviews, not better ones. When a reviewer sits down to review a PR that was opened two days ago, they have no context on the author’s thinking. They review the code cold, missing the intent behind decisions. They leave comments that the author already considered and rejected, triggering unnecessary back-and-forth.

Large PRs make this worse. When a review has been delayed, the author often keeps working on the same branch, adding more changes to avoid opening a second PR while the first one waits. What started as a 50-line change becomes a 300-line change. Research consistently shows that reviewer effectiveness drops sharply after 200 lines. Large PRs get superficial reviews - the reviewer skims the diff, leaves a few surface-level comments, and approves because they do not have time to review it thoroughly.

Fast reviews are better reviews. A reviewer who looks at a 50-line change within an hour of it being opened has full context on what the team is working on, can ask the author questions in real time, and can give focused attention to a small, digestible change.

It creates hidden WIP

Every open PR is work in progress. The code is written but not integrated. The developer who authored it has moved on to something new, but their previous work is still “in progress” from the team’s perspective. A team of five with eight open PRs has eight items of hidden WIP that do not appear on the sprint board as “in progress” but consume the same attention.

This hidden WIP interacts badly with explicit WIP. A developer who has one story “in progress” on the board but three PRs waiting for review is actually juggling four streams of work. Each PR that gets comments requires a context switch back to code they wrote days ago. The cognitive overhead is real even if the board does not show it.

Impact on continuous delivery

Continuous delivery requires that every change move from commit to production quickly and predictably. Review bottlenecks create an unpredictable queue between “code complete” and “integrated.” The queue length varies based on reviewer availability, competing priorities, and team habits. Some PRs merge in hours, others take days. This variability makes delivery timelines unpredictable and prevents the steady flow of small changes that CD depends on.

The bottleneck also discourages the small, frequent changes that make CD safe. Developers learn that every PR costs a multi-day wait, so they batch changes into larger PRs to reduce the number of times they pay that cost. Larger PRs are riskier, harder to review, and more likely to cause problems - exactly the opposite of what CD requires.

How to Fix It

Step 1: Measure review turnaround time (Week 1)

You cannot fix what you do not measure. Start tracking two numbers:

Time to first review: elapsed time from PR opened to first reviewer comment or approval.
PR age at merge: elapsed time from PR opened to PR merged.

Most teams discover their average is far worse than they assumed. Developers think reviews happen in a few hours. The data shows days.

Step 2: Set a team review SLA (Week 1)

Agree as a team on a review turnaround target. A reasonable starting point:

Reviews within 2 hours during working hours.
PR age at merge under 24 hours.

Write this down as a working agreement. Post it on the board. This is not a suggestion - it is a team commitment.

Step 3: Make reviews a first-class activity (Week 2)

The core behavior change: reviewing code is not something you do when you have spare time. It is the highest-priority activity after your current task reaches a natural stopping point.

Concrete practices:

Check for open PRs before starting new work. When a developer finishes a task or hits a natural pause, their first action is to check for pending reviews, not to pull a new story.
Auto-assign reviewers. Do not wait for someone to volunteer. Configure your tools to assign a reviewer automatically when a PR is opened.
Rotate reviewers. Do not let one or two people carry all the review load. Any team member should be able to review any PR. This spreads knowledge and distributes the work.
Keep PRs small. Target under 200 lines of changed code. Small PRs get reviewed faster and more thoroughly. If a developer says their PR is “too large to split,” that is a work decomposition problem.

Step 4: Consider synchronous review (Week 3+)

The fastest review is one that happens in real time. If async review consistently exceeds the team’s SLA, move toward synchronous alternatives:

Method	How it works	Review wait time
Pair programming	Two developers write the code together. Review is continuous.	Zero
Over-the-shoulder	Author walks reviewer through the change on a call.	Minutes
Rapid async	PR opened, reviewer notified, review within 2 hours.	Under 2 hours
Traditional async	PR opened, reviewer gets to it when they can.	Hours to days

Pair programming eliminates the review bottleneck entirely. The code is reviewed as it is written. There is no PR, no queue, and no wait. For teams that struggle with review latency, pairing is often the most effective solution.

Step 5: Address the objections

Objection	Response
“I can’t drop what I’m doing to review”	You are not dropping everything. You are checking for reviews at natural stopping points: after a commit, after a test passes, after a meeting. Reviews that take 10 minutes should not require “dropping” anything.
“Reviews take too long because the PRs are too big”	Then the PRs need to be smaller. A 50-line change takes 5-10 minutes to review. The review is not the bottleneck - the PR size is.
“Only senior developers can review this code”	That is a knowledge silo. Rotate reviewers so that everyone builds familiarity with every part of the codebase. Junior developers reviewing senior code is learning. Senior developers reviewing junior code is mentoring. Both are valuable.
“We need two reviewers for compliance”	Check whether your compliance framework actually requires two human reviewers, or whether it requires two sets of eyes on the code. Pair programming satisfies most separation-of-duties requirements while eliminating review latency.
“We tried faster reviews and quality dropped”	Fast does not mean careless. Automate style checks so reviewers focus on logic, correctness, and design. Small PRs get better reviews than large ones regardless of speed.

Measuring Progress

Metric	What to look for
Time to first review	Should drop below 2 hours
PR age at merge	Should drop below 24 hours
Open PR count	Should stay low - ideally fewer than the number of team members
PR size (lines changed)	Should trend below 200 lines
Review rework cycles	Should stay under 2 rounds per PR
Development cycle time	Should decrease as review wait time drops

Code Review - The practice guide for CD-compatible review techniques
Trunk-Based Development - Daily integration requires fast reviews
Push-Based Work Assignment - Push assignment makes reviews feel like a distraction from “my work”
Too Much Work in Progress - Slow reviews create hidden WIP as PRs queue up
Work Decomposition - Small PRs start with small stories

1.1.2 - Work Items Too Large

Work items regularly take more than a week. Developers work on a single item for days without integrating.

Category: Team Workflow | Quality Impact: High

What This Looks Like

A developer picks up a work item on Monday. By Wednesday, they are still working on it. By Friday, it is “almost done.” The following Monday, they are fixing edge cases. The item finally moves to review mid-week - a 300-line pull request that the reviewer does not have time to look at carefully.

Common variations:

The week-long item. Work items routinely take five or more days. Developers work on a single item for an entire sprint without integrating to trunk. The branch diverges further every day.
The “it’s really just one thing” item. A ticket titled “Add user profile page” hides a login form, avatar upload, email verification, notification preferences, and password reset. It looks like one feature to the product owner. It is six features to the developer.
The point-inflated item. The team estimates work at 8 or 13 points. Nobody questions whether an 8-point item should be decomposed. High estimates are treated as a property of the work rather than a signal that the work is too big.
The “spike that became a feature.” A time-boxed investigation turns into an implementation. The developer keeps going because they have momentum, and the result is a large, unreviewed change that was never planned or decomposed.
The horizontal slice. Work is split by technical layer: “build the database schema,” “build the API,” “build the UI.” Each item takes days because it spans the entire layer. Nothing is deployable until all three are done.

The telltale sign: look at the team’s cycle time distribution. If work items regularly take five or more days from start to done, the items are too large.

Why This Is a Problem

Large work items are not just slow. They are a compounding force that makes every other part of the delivery process worse.

They prevent daily integration

Trunk-based development requires integrating to trunk at least once per day. A work item that takes a week to complete cannot be integrated daily unless it is decomposed into smaller pieces that are each independently integrable. Most teams with large work items do not decompose them - they work on a branch for the full duration and merge at the end.

This means a week of work is invisible to the rest of the team until it lands as a single large merge. A week of assumptions go untested against the real state of trunk. A week of potential merge conflicts accumulate silently.

When work items are small enough to complete in one to two days, each item is a natural integration point. The developer finishes the item, integrates to trunk, and the change is tested, reviewed, and deployed before the next item begins.

They make estimation meaningless

Large work items hide unknowns. An item estimated at 8 points might take three days or three weeks depending on what the developer discovers along the way. The estimate is a guess wrapped in false precision.

This makes planning unreliable. The team commits to a set of large items, discovers mid-sprint that one of them is twice as big as expected, and scrambles at the end. The retrospective identifies “estimation accuracy” as the problem, but the real problem is that the items were too big to estimate accurately in the first place.

Small work items are inherently more predictable. An item that takes one to two days has a narrow range of uncertainty. Even if the estimate is off, it is off by hours, not weeks. Plans built from small items are more reliable because the variance of each item is small.

They increase rework

A developer working on a large item makes dozens of decisions over several days: architectural choices, naming conventions, error handling approaches, API contracts. These decisions are made in isolation. Nobody sees them until the code review, which happens after all the work is done.

When the reviewer disagrees with a fundamental decision made on day one, the developer has built five days of work on top of it. The rework cost is enormous. They either rewrite large portions of the code or the team accepts a suboptimal decision because the cost of changing it is too high.

With small items, decisions surface quickly. A one-day item produces a small pull request that is reviewed within hours. If the reviewer disagrees with an approach, the cost of changing it is a few hours of work, not a week. Fundamental design problems are caught early, before layers of code are built on top of them.

They hide risk until the end

A large work item carries risk that is invisible until late in its lifecycle. The developer might discover on day four that the chosen approach does not work, that an API they depend on behaves differently than documented, or that the database cannot handle the query pattern they assumed.

When this discovery happens on day four of a five-day item, the options are bad: rush a fix, cut scope, or miss the sprint commitment. The team had no visibility into the risk because the work was a single opaque block on the board.

Small items surface risk early. If the approach does not work, the team discovers it on day one of a one-day item. The cost of changing direction is minimal. The risk is contained to a small unit of work rather than spreading across an entire feature.

Impact on continuous delivery

Continuous delivery is built on small, frequent, low-risk changes flowing through the pipeline. Large work items produce the opposite: infrequent, high-risk changes that batch up in branches and land as large merges.

A team with five developers working on five large items has zero deployable changes for days at a time. Then several large changes land at once, the pipeline is busy for hours, and conflicts between the changes create unexpected failures. This is batch-and-queue delivery wearing agile clothing.

The feedback loop is broken too. A small change deployed to production gives immediate signal: does the change work? Does it affect performance? Do users behave as expected? A large change deployed after a week gives noisy signal: something changed, but which of the fifty modifications caused the issue?

How to Fix It

Step 1: Establish the 2-day rule (Week 1)

Agree as a team: no work item should take longer than two days from start to integrated on trunk.

This is not a velocity target. It is a constraint on item size. If an item cannot be completed in two days, it must be decomposed before it is pulled into the sprint.

Write this as a working agreement and enforce it during planning. When someone estimates an item at more than two days, the response is “how do we split this?” - not “who can do it faster?”

Step 2: Learn vertical slicing (Week 2)

The most common decomposition mistake is horizontal slicing - splitting by technical layer instead of by user-visible behavior. Train the team on vertical slicing:

Horizontal (avoid):

Work item	Deployable?	Testable end-to-end?
Build the database schema for orders	No	No
Build the API for orders	No	No
Build the UI for orders	Only after all three are done	Only after all three are done

Vertical (prefer):

Work item	Deployable?	Testable end-to-end?
User can create a basic order (DB + API + UI)	Yes	Yes
User can add a discount to an order	Yes	Yes
User can view order history	Yes	Yes

Each vertical slice cuts through all layers to deliver a thin piece of complete functionality. Each is independently deployable and testable. Each gives feedback before the next slice begins.

Step 3: Use acceptance criteria as a splitting signal (Week 2+)

Count the acceptance criteria on each work item. If an item has more than three to five acceptance criteria, it is probably too big. Each criterion or small group of criteria can become its own item.

Write acceptance criteria in concrete Given-When-Then format. Each scenario is a natural decomposition boundary:

Scenario: Apply percentage discount
  Given a cart with items totaling $100
  When I apply a 10% discount code
  Then the cart total should be $90

Scenario: Reject expired discount code
  Given a cart with items totaling $100
  When I apply an expired discount code
  Then the cart total should remain $100

Each scenario can be implemented, integrated, and deployed independently.

Work items should arrive at planning already decomposed. If the team is splitting items mid-sprint, refinement is not doing its job.

During backlog refinement:

Product owner presents the feature or outcome.
Team discusses the scope and writes acceptance criteria.
If the item has more than three to five criteria, split it immediately.
Each resulting item is estimated. Any item over two days is split again.
Items enter the sprint already small enough to flow.

Step 5: Address the objections

Objection	Response
“Splitting creates too many items to manage”	Small items are easier to manage, not harder. They have clear scope, predictable timelines, and simple reviews. The overhead per item should be near zero. If it is not, simplify your process.
“Some things can’t be done in two days”	Almost anything can be decomposed further. Database migrations can be done in backward-compatible steps. UI changes can be hidden behind feature flags. The skill is finding the decomposition, not deciding whether one exists.
“We’ll lose the big picture if we split too much”	The epic or feature still exists as an organizing concept. Small items are not independent fragments - they are ordered steps toward a defined outcome. Use an epic to track the overall feature and individual items to track the increments.
“Product doesn’t want partial features”	Feature flags let you deploy incomplete features without exposing them to users. The code is integrated and tested continuously, but the user-facing feature is toggled on only when all slices are done.
“Our estimates are fine, items just take longer than expected”	That is the definition of items being too big. Small items have narrow estimation variance. If a one-day item takes two days, you are off by a day. If a five-day item takes ten, you have lost a sprint.

Measuring Progress

Metric	What to look for
Item cycle time	Should be two days or less from start to trunk
Development cycle time	Should decrease as items get smaller
Items completed per week	Should increase even if total output stays the same
Integration frequency	Should increase as developers integrate completed items daily
Items that exceed the 2-day rule	Track violations and discuss in retrospectives
Work in progress	Should decrease as smaller items flow through faster

Work Decomposition - The practice guide for breaking work into small increments
Small Batches - Batch size reduction at every level, from stories to commits to deploys
Too Much Work in Progress - Large items inflate WIP because they occupy a slot for days
PR Review Bottlenecks - Large items produce large PRs that reviewers avoid
Trunk-Based Development - Daily integration requires items small enough to finish in a day or two

1.1.3 - No Vertical Slicing

Work is organized by technical layer - “build the API,” “build the UI” - rather than by user-visible behavior. Nothing is deployable until all layers are done.

Category: Team Workflow | Quality Impact: Medium

What This Looks Like

The team breaks a feature into work items by architectural layer. One item for the database schema. One for the API. One for the frontend. Maybe one for “integration testing” at the end. Each item lives in a different lane or is assigned to a different specialist. Nothing reaches production until the last layer is finished and all the pieces are stitched together.

Common variations:

Layer-based assignment. “The backend team builds the API, the frontend team builds the UI.” Each team delivers their layer independently. Integration is a separate phase that happens after both teams are “done.”
The database-first approach. Every feature starts with “build the schema.” Weeks of database work happen before any API or UI exists. The schema is designed for the complete feature rather than for the first thin slice.
The API-then-UI pattern. The API is built and “tested” in isolation with Postman or curl. The UI is built weeks later against the API. Mismatches between what the API provides and what the UI needs are discovered at the end.
The “integration sprint.” After the layers are built separately, the team dedicates a sprint to wiring everything together. This sprint always takes longer than planned because the layers were built on different assumptions.
Technical stories on the board. The backlog contains items like “create database indexes,” “add caching layer,” or “refactor service class.” None of these deliver user-visible value. They are infrastructure work that has been separated from the feature it supports.

The telltale sign: ask “can we deploy this work item to production and have a user see something different?” If the answer is no, the work is sliced horizontally.

Why This Is a Problem

Horizontal slicing feels natural to developers because it matches how they think about the system’s architecture. But it optimizes for how the code is organized, not for how value is delivered. The consequences compound across every dimension of delivery.

Nothing is deployable until everything is done

A horizontal slice delivers no user-visible value on its own. The database schema alone does nothing. The API alone does nothing a user can see. The UI alone has no data to display. Value only emerges when all layers are assembled - and that assembly happens at the end.

This means the team has zero deployable output for the entire duration of the feature build. A feature that takes three sprints to build across layers produces three sprints of work in progress and zero deliverables. The team is busy the entire time, but nothing reaches production.

With vertical slicing, every item is deployable. The first slice might be “user can create a basic order” - thin, but it touches the database, API, and UI. It can be deployed to production behind a feature flag on day two. Feedback starts immediately. The remaining slices build on a working foundation rather than converging on an untested one.

Integration risk accumulates invisibly

When layers are built separately, each team or developer makes assumptions about how their layer will connect to the others. The backend developer assumes the API contract looks a certain way. The frontend developer assumes the response format matches their component design. The database developer assumes the query patterns align with how the API will call them.

These assumptions are untested until integration. The longer the layers are built in isolation, the more assumptions accumulate and the more likely they are to conflict. Integration becomes the riskiest phase of the project - the phase where all the hidden mismatches surface at once.

With vertical slicing, integration happens with every item. The first slice forces the developer to connect all the layers immediately. Assumptions are tested on day one, not month three. Subsequent slices extend a working, integrated system rather than building isolated components that have never talked to each other.

Feedback is delayed until it is expensive to act on

A horizontal approach delays user feedback until the full feature is assembled. If the team builds the wrong thing - misunderstands a requirement, makes a poor UX decision, or solves the wrong problem - they discover it after weeks of work across multiple layers.

At that point, the cost of changing direction is enormous. The database schema, API contracts, and UI components all need to be reworked. The team has already invested heavily in an approach that turns out to be wrong.

Vertical slicing delivers feedback with every increment. The first slice ships a thin version of the feature that real users can see. If the approach is wrong, the team discovers it after a day or two of work, not after a month. The cost of changing direction is the cost of one small item, not the cost of an entire feature.

It creates specialist dependencies and handoff delays

Horizontal slicing naturally leads to specialist assignment: the database expert takes the database item, the API expert takes the API item, the frontend expert takes the frontend item. Each person works in isolation on their layer, and the work items have dependencies between them - the API cannot be built until the schema exists, the UI cannot be built until the API exists.

These dependencies create sequential handoffs. The database work finishes, but the API developer is busy with something else. The API work finishes, but the frontend developer is mid-sprint on a different feature. Each handoff introduces wait time that has nothing to do with the complexity of the work.

Vertical slicing eliminates these dependencies. A single developer (or pair) implements the full slice across all layers. There are no handoffs between layers because one person owns the entire thin slice from database to UI. This also spreads knowledge - developers who work across all layers understand the full system, not just their specialty.

Impact on continuous delivery

Continuous delivery requires a continuous flow of small, independently deployable changes. Horizontal slicing produces the opposite: a batch of interdependent layer changes that can only be deployed together after a separate integration phase.

A team that slices horizontally cannot deploy continuously because there is nothing to deploy until all layers converge. They cannot get production feedback because nothing user-visible exists until the end. They cannot limit risk because the first real test of the integrated system happens after all the work is done.

The pipeline itself becomes less useful. When changes are horizontal slices, the pipeline can only verify that one layer works in isolation - it cannot run meaningful end-to-end tests until all layers exist. The pipeline gives a false green signal (“the API tests pass”) that hides the real question (“does the feature work?”).

How to Fix It

Step 1: Learn to recognize horizontal slices (Week 1)

Before changing how the team slices, build awareness. Review the current sprint board and backlog. For each work item, ask:

Can a user (or another system) observe the change after this item is deployed?
Can I write an end-to-end test for this item alone?
Does this item deliver value without waiting for other items to be completed?

If the answer to any of these is no, the item is likely a horizontal slice. Tag these items and count them. Most teams discover that a majority of their backlog is horizontally sliced.

Step 2: Reslice one feature vertically (Week 2)

Pick one upcoming feature and practice reslicing it. Start with the current horizontal breakdown and convert it:

Before (horizontal):

Create the database tables for notifications
Build the notification API endpoints
Build the notification preferences UI
Integration testing for notifications

After (vertical):

User receives an email notification when their order ships (DB + API + email + minimal UI)
User can view notification history on their profile page
User can disable email notifications for order updates
User can choose between email and SMS for shipping notifications

Each vertical slice is independently deployable and testable end-to-end. Each delivers something a user can see. The team gets feedback after item 1 instead of after item 4.

Make the deployability test a standard part of backlog refinement. For every proposed work item, ask: “If this were the only thing we shipped this sprint, would a user notice?”

If not, the item needs reslicing. This single question catches most horizontal slices before they enter the sprint.

Complement this with concrete acceptance criteria in Given-When-Then format. Each scenario should describe observable behavior, not technical implementation:

Good: “Given a registered user, when they update their email, then a verification link is sent to the new address”
Bad: “Build the email verification API endpoint”

Step 4: Break the specialist habit (Week 4+)

Horizontal slicing and specialist assignment reinforce each other. As long as “the backend developer does the backend work,” slicing by layer feels natural.

Break this cycle:

Have developers work full-stack on vertical slices. A developer who implements the entire slice - database, API, and UI - will naturally slice vertically because they own the full delivery.
Pair a specialist with a generalist. If a developer is uncomfortable with a particular layer, pair them with someone who knows it. This builds cross-layer skills while delivering vertical slices.
Rotate who works on what. Do not let the same person always take the database items. When anyone can work on any layer, the team stops organizing work by layer.

Step 5: Address the objections

Objection	Response
“Our developers are specialists - they can’t work across layers”	That is a skill gap, not a constraint. Pairing a frontend developer with a backend developer on a vertical slice builds the missing skills while delivering the work. The short-term slowdown produces long-term flexibility.
“The database schema needs to be designed holistically”	Design the schema incrementally. Add the columns and tables needed for the first slice. Extend them for the second. This is how trunk-based database evolution works - backward-compatible, incremental changes. Designing the “complete” schema upfront leads to speculative design that changes anyway.
“Vertical slices create duplicate work across layers”	They create less total work because integration problems are caught immediately instead of accumulating. The “duplicate” concern usually means the team is building more infrastructure than the current slice requires. Build only what the current slice needs.
“Some work is genuinely infrastructure”	True infrastructure work (setting up a new database, provisioning a service) still needs to be connected to a user outcome. “Provision the notification service and send one test notification” is a vertical slice that includes the infrastructure.
“Our architecture makes vertical slicing hard”	That is a signal about the architecture. Tightly coupled layers that cannot be changed independently are a deployment risk. Vertical slicing exposes this coupling early, which is better than discovering it during a high-stakes integration phase.

Measuring Progress

Metric	What to look for
Percentage of work items that are independently deployable	Should increase toward 100%
Time from feature start to first production deploy	Should decrease as the first vertical slice ships early
Development cycle time	Should decrease as items no longer wait for other layers
Integration issues discovered late	Should decrease as integration happens with every slice
Integration frequency	Should increase as deployable slices are completed and merged daily

Work Decomposition - The practice guide for vertical slicing techniques
Small Batches - Vertical slicing is how you achieve small batch size at the story level
Work Items Too Large - Horizontal slices are often large because they span an entire layer
Trunk-Based Development - Vertical slices enable daily integration because each is independently complete

1.1.4 - Too Much Work in Progress

Every developer is on a different story. Eight items in progress, zero done. Nothing gets the focused attention needed to finish.

Category: Team Workflow | Quality Impact: High

What This Looks Like

Open the team’s board on any given day. Count the items in progress. Now count the team members. If the first number is significantly higher than the second, the team has a WIP problem.

Common variations:

Everyone on a different story. A team of five has eight or more stories in progress. Nobody is working on the same thing. The board is a wide river of half-finished work.
Sprint-start explosion. On the first day of the sprint, every developer pulls a story. By mid-sprint, all stories are “in progress” and none are “done.” The last day is a scramble to close anything.
Individual WIP hoarding. A single developer has three stories assigned: one they’re actively coding, one waiting for review, and one blocked on a question. They count all three as “in progress” and start nothing new - but they also don’t help anyone else finish.
Hidden WIP. The board shows five items in progress, but each developer is also investigating a production bug, answering questions about a previous story, and prototyping something for next sprint. Unofficial work doesn’t appear on the board but consumes the same attention.
Expedite as default. Urgent requests arrive mid-sprint. Instead of replacing existing work, they stack on top. WIP grows because nothing is removed when something is added.

The telltale sign: the team is busy all the time but finishes very little. Stories take longer and longer to complete. The sprint ends with a pile of items at 80% done.

Why This Is a Problem

High WIP is not a sign of a productive team. It is a sign of a team that has optimized for starting work instead of finishing it. The consequences compound over time.

It destroys focus and increases context switching

Every item in progress competes for a developer’s attention. A developer working on one story can focus deeply. A developer juggling three stories - one active, one waiting for review, one they need to answer questions about - is constantly switching context. Research consistently shows that each additional concurrent task reduces productive time by 20-40%.

The switching cost is not just time. It is cognitive load. Developers lose their mental model of the code when they switch away, and it takes 15-30 minutes to rebuild it when they switch back. Multiply this across five context switches per day and the team is spending more time reloading context than writing code.

In a low-WIP environment, developers finish one thing before starting the next. Deep focus is the default. Context switching is the exception, not the rule.

It inflates cycle time

Little’s Law is not a suggestion. It is a mathematical relationship: cycle time equals work in progress divided by throughput. If a team’s throughput is roughly constant (and over weeks, it is), the only way to reduce cycle time is to reduce WIP.

A team of five with a throughput of ten stories per sprint and five stories in progress has an average cycle time of half a sprint. The same team with fifteen stories in progress has an average cycle time of 1.5 sprints. The work is not getting done faster because more of it was started. It is getting done slower because all of it is competing for the same capacity.

Long cycle times create their own problems. Feedback is delayed. Requirements go stale. Integration conflicts accumulate. The longer a story sits in progress, the more likely it is to need rework when it finally reaches review or testing.

It hides bottlenecks

When WIP is high, bottlenecks are invisible. If code reviews are slow, a developer just starts another story while they wait. If the test environment is broken, they work on something else. The constraint is never confronted because there is always more work to absorb the slack.

This is comfortable but destructive. The bottleneck does not go away because the team is working around it. It quietly degrades the system. Reviews pile up. Test environments stay broken. The team’s real throughput is constrained by the bottleneck, but nobody feels the pain because they are always busy.

When WIP is limited, bottlenecks become immediately visible. A developer who cannot start new work because the WIP limit is reached has to swarm on something blocked. “I’m idle because my PR has been waiting for review for two hours” is a problem the team can solve. “I just started another story while I wait” hides the same problem indefinitely.

It prevents swarming and collaboration

When every developer has their own work in progress, there is no incentive to help anyone else. Reviewing a teammate’s pull request, pairing on a stuck story, or helping debug a failing test all feel like distractions from “my work.” The result is that every item moves through the pipeline alone, at the pace of a single developer.

Swarming - multiple team members working together to finish the highest-priority item - is impossible when everyone has their own stories to protect. If you ask a developer to drop their current story and help finish someone else’s, you are asking them to fall behind on their own work. The incentive structure is broken.

In a low-WIP environment, finishing the team’s most important item is everyone’s job. When only three items are in progress for a team of five, two people are available to pair, review, or unblock. Collaboration is the natural state, not a special request.

Impact on continuous delivery

Continuous delivery requires a steady flow of small, finished changes moving through the pipeline. High WIP produces the opposite: a large batch of unfinished changes sitting in various stages of completion, blocking each other, accumulating merge conflicts, and stalling in review queues.

A team with fifteen items in progress does not deploy fifteen times as often as a team with one item in progress. They deploy less frequently because nothing is fully done. Each “almost done” story is a small batch that has not yet reached the pipeline. The batch keeps growing until something forces a reckoning - usually the end of the sprint.

The feedback loop breaks too. When changes sit in progress for days, the developer who wrote the code has moved on by the time the review comes back or the test fails. They have to reload context to address feedback, which takes more time, which delays the next change, which increases WIP further. The cycle reinforces itself.

How to Fix It

Step 1: Make WIP visible (Week 1)

Before setting any limits, make the current state impossible to ignore.

Count every item currently in progress for the team. Include stories, bugs, spikes, and any unofficial work that is consuming attention.
Write this number on the board. Update it daily.
Most teams are shocked. A team of five typically discovers 12-20 items in progress once hidden work is included.

Do not try to fix anything yet. The goal is awareness.

Step 2: Set an initial WIP limit (Week 2)

Use the N+2 formula as a starting point, where N is the number of team members actively working on delivery.

Team size	Starting WIP limit	Why
3 developers	5 items	One per person plus a buffer for blocked items
5 developers	7 items	Same ratio
8 developers	10 items	Buffer shrinks proportionally

Add the limit to the board as a column header or policy: “In Progress (limit: 7).” Agree as a team that when the limit is reached, nobody starts new work.

Step 3: Enforce the limit with swarming (Week 3+)

When the WIP limit is reached and a developer finishes something, they have two options:

Pull the next highest-priority item if the WIP count is below the limit.
Swarm on an existing item if the WIP count is at the limit.

Swarming means pairing on a stuck story, reviewing a pull request, writing a test someone needs help with, or resolving a blocker. The key behavior change: “I have nothing to do” is never the right response. “What can I help finish?” is.

Step 4: Lower the limit over time (Monthly)

The initial limit is a starting point. Each month, consider reducing it by one.

Limit	What it exposes
N+2	Gross overcommitment. Most teams find this is already a significant reduction.
N+1	Slow reviews, environment contention, unclear requirements. Team starts swarming.
N	Every person on one thing. Blocked items get immediate attention.
Below N	Team is pairing by default. Cycle time drops sharply.

Each reduction will feel uncomfortable. That discomfort is the point - it exposes constraints in the workflow that were hidden by excess WIP.

Step 5: Address the objections

Expect resistance and prepare for it:

Objection	Response
“I’ll be idle if I can’t start new work”	Idle hands are not the problem. Idle work is. Help finish something instead of starting something new.
“Management will see people not typing and think we’re wasting time”	Track cycle time and throughput. When both improve, the data speaks for itself.
“We have too many priorities to limit WIP”	Having many priorities is exactly why you need a WIP limit. Without one, nothing gets the focus needed to finish. Everything is “in progress,” nothing is done.
“What about urgent production issues?”	Keep one expedite slot. If a production issue arrives, it takes the slot. If the slot is full, the new issue replaces the current one. Expedite is not a way to bypass the limit - it is part of the limit.
“Our stories are too big to pair on”	That is a separate problem. See Work Decomposition. Stories should be small enough that anyone can pick them up.

Measuring Progress

Metric	What to look for
Work in progress	Should stay at or below the team’s limit
Development cycle time	Should decrease as WIP drops
Stories completed per week	Should stabilize or increase despite starting fewer items
Time items spend blocked	Should decrease as the team swarms on blockers
Sprint-end scramble	Should disappear as work finishes continuously through the sprint

Limiting WIP - The practice guide for implementing WIP limits
Small Batches - Reducing batch size at every level reinforces low WIP
Work Decomposition - Stories must be small enough to flow through a WIP-limited system
Push-Based Work Assignment - Push assignment and high WIP are mutually reinforcing anti-patterns

1.1.5 - Push-Based Work Assignment

Work is assigned to individuals by a manager or lead instead of team members pulling the next highest-priority item.

Category: Team Workflow | Quality Impact: High

What This Looks Like

A manager, tech lead, or project manager decides who works on what. Assignments happen during sprint planning, in one-on-ones, or through tickets pre-assigned before the sprint starts. Each team member has “their” stories for the sprint. The assignment is rarely questioned.

Common variations:

Assignment by specialty. “You’re the database person, so you take the database stories.” Work is routed by perceived expertise rather than team priority.
Assignment by availability. A manager looks at who is “free” and assigns the next item from the backlog, regardless of what the team needs finished.
Assignment by seniority. Senior developers get the interesting or high-priority work. Junior developers get what’s left.
Pre-loaded sprints. Every team member enters the sprint with their work already assigned. The sprint board is fully allocated on day one.

The telltale sign: if you ask a developer “what should you work on next?” and the answer is “I don’t know, I need to ask my manager,” work is being pushed.

Why This Is a Problem

Push-based assignment is one of the most quietly destructive practices a team can have. It undermines nearly every CD practice by breaking the connection between the team and the flow of work. Each of its effects compounds the others.

It reduces quality

Push assignment makes code review feel like a distraction from “my stories.” When every developer has their own assigned work, reviewing someone else’s pull request is time spent not making progress on your own assignment. Reviews sit for hours or days because the reviewer is busy with their own work. The same dynamic discourages pairing: spending an hour helping a colleague means falling behind on your own assignments, so developers don’t offer and don’t ask.

This means fewer eyes on every change. Defects that a second person would catch in minutes survive into production. Knowledge stays siloed because there is no reason to look at code outside your assignment. The team’s collective understanding of the codebase narrows over time.

In a pull system, reviewing code and unblocking teammates are the highest-priority activities because finishing the team’s work is everyone’s work. Reviews happen quickly because they are not competing with “my stories” - they are the work. Pairing happens naturally because anyone might pick up any story, and asking for help is how the team moves its highest-priority item forward.

It increases rework

Push assignment routes work by specialty: “You’re the database person, so you take the database stories.” This creates knowledge silos where only one person understands a part of the system. When the same person always works on the same area, mistakes go unreviewed by anyone with a fresh perspective. Assumptions go unchallenged because the reviewer lacks context to question them.

Misinterpretation of requirements also increases. The assigned developer may not have context on why a story is high priority or what business outcome it serves - they received it as an assignment, not as a problem to solve. When the result doesn’t match what was needed, the story comes back for rework.

In a pull system, anyone might pick up any story, so knowledge spreads across the team. Fresh eyes catch assumptions that a domain expert would miss. Developers who pull a story engage with its priority and purpose because they chose it from the top of the backlog. Rework drops because more perspectives are involved earlier.

It makes delivery timelines unpredictable

Push assignment optimizes for utilization - keeping everyone busy - not for flow - getting things done. Every developer has their own assigned work, so team WIP is the sum of all individual assignments. There is no mechanism to say “we have too much in progress, let’s finish something first.” WIP limits become meaningless when the person assigning work doesn’t see the full picture.

Bottlenecks are invisible because the manager assigns around them instead of surfacing them. If one area of the system is a constraint, the assigner may not notice because they are looking at people, not flow. In a pull system, the bottleneck becomes obvious: work piles up in one column and nobody pulls it because the downstream step is full.

Workloads are uneven because managers cannot perfectly predict how long work will take. Some people finish early and sit idle or start low-priority work, while others are overloaded. Feedback loops are slow because the order of work is decided at sprint planning; if priorities change mid-sprint, the manager must reassign. Throughput becomes erratic - some sprints deliver a lot, others very little, with no clear pattern.

In a pull system, workloads self-balance: whoever finishes first pulls the next item. Bottlenecks are visible. WIP limits actually work because the team collectively decides what to start. The team automatically adapts to priority changes because the next person who finishes simply pulls whatever is now most important.

It removes team ownership

Pull systems create shared ownership of the backlog. The team collectively cares about the priority order because they are collectively responsible for finishing work. Push systems create individual ownership: “that’s not my story.” When a developer finishes their assigned work, they wait for more assignments instead of looking at what the team needs.

This extends beyond task selection. In a push system, developers stop thinking about the team’s goals and start thinking about their own assignments. Swarming - multiple people collaborating to finish the highest-priority item - is impossible when everyone “has their own stuff.” If a story is stuck, the assigned developer struggles alone while teammates work on their own assignments.

The unavailability problem makes this worse. When each person works in isolation on “their” stories, the rest of the team has no context on what that person is doing, how the work is structured, or what decisions have been made. If the assigned person is out sick, on vacation, or leaves the company, nobody can pick up where they left off. The work either stalls until that person returns or another developer starts over - rereading requirements, reverse-engineering half-finished code, and rediscovering decisions that were never shared. In a pull system, the team maintains context on in-progress work because anyone might have pulled it, standups focus on the work rather than individual status, and pairing spreads knowledge continuously. When someone is unavailable, the next person simply picks up the item with enough shared context to continue.

Impact on continuous delivery

Continuous delivery depends on a steady, predictable flow of small changes through the pipeline. Push-based assignment produces the opposite: batch-based assignment at sprint planning, uneven bursts of activity as different developers finish at different times, blocked work sitting idle because the assigned person is busy with something else, and no team-level mechanism for optimizing throughput. You cannot build a continuous flow of work when the assignment model is batch-based and individually scoped.

How to Fix It

Step 1: Order the backlog by priority (Week 1)

Before switching to a pull model, the backlog must have a clear priority order. Without it, developers will not know what to pull next.

Work with the product owner to stack-rank the backlog. Every item has a unique position - no tied priorities.
Make the priority visible. The top of the board or backlog is the most important item. There is no ambiguity.
Agree as a team: when you need work, you pull from the top.

Step 2: Stop pre-assigning work in sprint planning (Week 2)

Change the sprint planning conversation. Instead of “who takes this story,” the team:

Pulls items from the top of the prioritized backlog into the sprint.
Discusses each item enough for anyone on the team to start it.
Leaves all items unassigned.

The sprint begins with a list of prioritized work and no assignments. This will feel uncomfortable for the first sprint.

Step 3: Pull work daily (Week 2+)

At the daily standup (or anytime during the day), a developer who needs work:

Looks at the sprint board.
Checks if any in-progress item needs help (swarm first, pull second).
If nothing needs help and the WIP limit allows, pulls the top unassigned item and assigns themselves.

The developer picks up the highest-priority available item, not the item that matches their specialty. This is intentional - it spreads knowledge, reduces bus factor, and keeps the team focused on priority rather than comfort.

Step 4: Address the discomfort (Weeks 3-4)

Expect these objections and plan for them:

Objection	Response
“But only Sarah knows the payment system”	That is a knowledge silo and a risk. Pairing Sarah with someone else on payment stories fixes the silo while delivering the work.
“I assigned work because nobody was pulling it”	If nobody pulls high-priority work, that is a signal: either the team doesn’t understand the priority, the item is poorly defined, or there is a skill gap. Assignment hides the signal instead of addressing it.
“Some developers are faster - I need to assign strategically”	Pull systems self-balance. Faster developers pull more items. Slower developers finish fewer but are never overloaded. The team throughput optimizes naturally.
“Management expects me to know who’s working on what”	The board shows who is working on what in real time. Pull systems provide more visibility than pre-assignment because assignments are always current, not a stale plan from sprint planning.

Step 5: Combine with WIP limits (Week 4+)

Pull-based work and WIP limits reinforce each other:

WIP limits prevent the team from pulling too much work at once.
Pull-based assignment ensures that when someone finishes, they pull the next priority - not whatever the manager thinks of next.
Together, they create a system where work flows continuously from backlog to done.

See Limiting WIP for how to set and enforce WIP limits.

What managers do instead

Moving to a pull model does not eliminate the need for leadership. It changes the focus:

Push model (before)	Pull model (after)
Decide who works on what	Ensure the backlog is prioritized and refined
Balance workloads manually	Coach the team on swarming and collaboration
Track individual assignments	Track flow metrics (cycle time, WIP, throughput)
Reassign work when priorities change	Update backlog priority and let the team adapt
Manage individual utilization	Remove systemic blockers the team cannot resolve

Measuring Progress

Metric	What to look for
Percentage of stories pre-assigned at sprint start	Should drop to near zero
Work in progress	Should decrease as team focuses on finishing
Development cycle time	Should decrease as swarming increases
Stories completed per sprint	Should stabilize or increase despite less “busyness”
Rework rate	Stories returned for rework or reopened after completion - should decrease
Knowledge distribution	Track who works on which parts of the system - should broaden over time

Limiting WIP - Pull-based work and WIP limits are complementary practices
Work Decomposition - Pull works best when items are small and well-defined
Working Agreements - The team’s agreement to pull, not push, should be explicit
Common Blockers - Push-based assignment contributes to several listed blockers

1.2 - Branching and Integration

Anti-patterns in how teams branch, merge, and integrate code that prevent continuous integration and delivery.

These anti-patterns affect how code flows from a developer’s machine to the shared trunk. They create painful merges, delayed integration, and broken builds that prevent the steady stream of small, verified changes that continuous delivery requires.

1.2.1 - Long-Lived Feature Branches

Branches that live for weeks or months, turning merging into a project in itself. The longer the branch, the bigger the risk.

Category: Branching & Integration | Quality Impact: Critical

What This Looks Like

A developer creates a branch to build a feature. The feature is bigger than expected. Days pass, then weeks. Other developers are doing the same thing on their own branches. Trunk moves forward while each branch diverges further from it. Nobody integrates until the feature is “done” - and by then, the branch is hundreds or thousands of lines different from where it started.

When the merge finally happens, it is an event. The developer sets aside half a day - sometimes more - to resolve conflicts, re-test, and fix the subtle breakages that come from combining weeks of divergent work. Other developers delay their merges to avoid the chaos. The team’s Slack channel lights up with “don’t merge right now, I’m resolving conflicts.” Every merge creates a window where trunk is unstable.

Common variations:

The “feature branch” that is really a project. A branch named feature/new-checkout that lasts three months. Multiple developers commit to it. It has its own bug fixes and its own merge conflicts. It is a parallel fork of the product.
The “I’ll merge when it’s ready” branch. The developer views the branch as a private workspace. Merging to trunk is the last step, not a daily practice. The branch falls further behind each day but the developer does not notice until merge day.
The per-sprint branch. Each sprint gets a branch. All sprint work goes there. The branch is merged at sprint end and a new one is created. Integration happens every two weeks instead of every day.
The release isolation branch. A branch is created weeks before a release to “stabilize” it. Bug fixes must be applied to both the release branch and trunk. Developers maintain two streams of work simultaneously.
The “too risky to merge” branch. The branch has diverged so far that nobody wants to attempt the merge. It sits for weeks while the team debates how to proceed. Sometimes it is abandoned entirely and the work is restarted.

The telltale sign: if merging a branch requires scheduling a block of time, notifying the team, or hoping nothing goes wrong - branches are living too long.

Why This Is a Problem

Long-lived feature branches appear safe. Each developer works in isolation, free from interference. But that isolation is precisely the problem. It delays integration, hides conflicts, and creates compounding risk that makes every aspect of delivery harder.

It reduces quality

When a branch lives for weeks, code review becomes a formidable task. The reviewer faces hundreds of changed lines across dozens of files. Meaningful review is nearly impossible at that scale - studies consistently show that review effectiveness drops sharply after 200-400 lines of change. Reviewers skim, approve, and hope for the best. Subtle bugs, design problems, and missed edge cases survive because nobody can hold the full changeset in their head.

The isolation also means developers make decisions in a vacuum. Two developers on separate branches may solve the same problem differently, introduce duplicate abstractions, or make contradictory assumptions about shared code. These conflicts are invisible until merge time, when they surface as bugs rather than design discussions.

With short-lived branches or trunk-based development, changes are small enough for genuine review. A 50-line change gets careful attention. Design disagreements surface within hours, not weeks. The team maintains a shared understanding of how the codebase is evolving because they see every change as it happens.

It increases rework

Long-lived branches guarantee merge conflicts. Two developers editing the same file on different branches will not discover the collision until one of them merges. The second developer must then reconcile their changes against an unfamiliar modification, often without understanding the intent behind it. This manual reconciliation is rework in its purest form - effort spent making code work together that would have been unnecessary if the developers had integrated daily.

The rework compounds. A developer who rebases a three-week branch against trunk may introduce bugs during conflict resolution. Those bugs require debugging. The debugging reveals an assumption that was valid three weeks ago but is no longer true because trunk has changed. Now the developer must rethink and partially rewrite their approach. What should have been a day of work becomes a week.

When developers integrate daily, conflicts are small - typically a few lines. They are resolved in minutes with full context because both changes are fresh. The cost of integration stays constant rather than growing exponentially with branch age.

It makes delivery timelines unpredictable

A two-day feature on a long-lived branch takes two days to build and an unknown number of days to merge. The merge might take an hour. It might take two days. It might surface a design conflict that requires reworking the feature. Nobody knows until they try. This makes it impossible to predict when work will actually be done.

The queuing effect makes it worse. When several branches need to merge, they form a queue. The first merge changes trunk, which means the second branch needs to rebase against the new trunk before merging. If the second merge is large, it changes trunk again, and the third branch must rebase. Each merge invalidates the work done to prepare the next one. Teams that “schedule” their merges are admitting that integration is so costly it needs coordination.

Project managers learn they cannot trust estimates. “The feature is code-complete” does not mean it is done - it means the merge has not started yet. Stakeholders lose confidence in the team’s ability to deliver on time because “done” and “deployed” are separated by an unpredictable gap.

With continuous integration, there is no merge queue. Each developer integrates small changes throughout the day. The time from “code-complete” to “integrated and tested” is minutes, not days. Delivery dates become predictable because the integration cost is near zero.

It hides risk until the worst possible moment

Long-lived branches create an illusion of progress. The team has five features “in development,” each on its own branch. The features appear to be independent and on track. But the risk is hidden: none of these features have been proven to work together. The branches may contain conflicting changes, incompatible assumptions, or integration bugs that only surface when combined.

All of that hidden risk materializes at merge time - the moment closest to the planned release date, when the team has the least time to deal with it. A merge conflict discovered three weeks before release is an inconvenience. A merge conflict discovered the day before release is a crisis. Long-lived branches systematically push risk discovery to the latest possible point.

Continuous integration surfaces risk immediately. If two changes conflict, the team discovers it within hours, while both changes are small and the authors still have full context. Risk is distributed evenly across the development cycle instead of concentrated at the end.

Impact on continuous delivery

Continuous delivery requires that trunk is always in a deployable state and that any commit can be released at any time. Long-lived feature branches make both impossible. Trunk cannot be deployable if large, poorly validated merges land periodically and destabilize it. You cannot release any commit if the latest commit is a 2,000-line merge that has not been fully tested.

Long-lived branches also prevent continuous integration - the practice of integrating every developer’s work into trunk at least once per day. Without continuous integration, there is no continuous delivery. The pipeline cannot provide fast feedback on changes that exist only on private branches. The team cannot practice deploying small changes because there are no small changes - only large merges separated by days or weeks of silence.

Every other CD practice - automated testing, pipeline automation, small batches, fast feedback - is undermined when the branching model prevents frequent integration.

How to Fix It

Step 1: Measure your current branch lifetimes (Week 1)

Before changing anything, understand the baseline. For every open branch:

Record when it was created and when (or if) it was last merged.
Calculate the age in days.
Note the number of changed files and lines.

Most teams are shocked by their own numbers. A branch they think of as “a few days old” is often two or three weeks old. Making the data visible creates urgency.

Set a target: no branch older than one day. This will feel aggressive. That is the point.

Step 2: Set a branch lifetime limit and make it visible (Week 2)

Agree as a team on a maximum branch lifetime. Start with two days if one day feels too aggressive. The important thing is to pick a number and enforce it.

Make the limit visible:

Add a dashboard or report that shows branch age for every open branch.
Flag any branch that exceeds the limit in the daily standup.
If your CI tool supports it, add a check that warns when a branch exceeds 24 hours.

The limit creates a forcing function. Developers must either integrate quickly or break their work into smaller pieces. Both outcomes are desirable.

Step 3: Break large features into small, integrable changes (Weeks 2-3)

The most common objection is “my feature is too big to merge in a day.” This is true when the feature is designed as a monolithic unit. The fix is decomposition:

Branch by abstraction. Introduce a new code path alongside the old one. Merge the new code path in small increments. Switch over when ready.
Feature flags. Hide incomplete work behind a toggle so it can be merged to trunk without being visible to users.
Keystone interface pattern. Build all the back-end work first, merge it incrementally, and add the UI entry point last. The feature is invisible until the keystone is placed.
Vertical slices. Deliver the feature as a series of thin, user-visible increments instead of building all layers at once.

Each technique lets developers merge daily without exposing incomplete functionality. The feature grows incrementally on trunk rather than in isolation on a branch.

Step 4: Adopt short-lived branches with daily integration (Weeks 3-4)

Change the team’s workflow:

Create a branch from trunk.
Make a small, focused change.
Get a quick review (the change is small, so review takes minutes).
Merge to trunk. Delete the branch.
Repeat.

Each branch lives for hours, not days. If a branch cannot be merged by end of day, it is too large. The developer should either merge what they have (using one of the decomposition techniques above) or discard the branch and start smaller tomorrow.

Pair this with the team’s code review practice. Small changes enable fast reviews, and fast reviews enable short-lived branches. The two practices reinforce each other.

Step 5: Address the objections (Weeks 3-4)

Objection	Response
“My feature takes three weeks - I can’t merge in a day”	The feature takes three weeks. The branch does not have to. Use branch by abstraction, feature flags, or vertical slicing to merge daily while the feature grows incrementally on trunk.
“Merging incomplete code to trunk is dangerous”	Incomplete code behind a feature flag or without a UI entry point is not dangerous - it is invisible. The danger is a three-week branch that lands as a single untested merge.
“I need my branch to keep my work separate from other changes”	That separation is the problem. You want to discover conflicts early, when they are small and cheap to fix. A branch that hides conflicts for three weeks is not protecting you - it is accumulating risk.
“We tried short-lived branches and it was chaos”	Short-lived branches require supporting practices: feature flags, good decomposition, fast CI, and a culture of small changes. Without those supports, it will feel chaotic. The fix is to build the supports, not to retreat to long-lived branches.
“Code review takes too long for daily merges”	Small changes take minutes to review, not hours. If reviews are slow, that is a review process problem, not a branching problem. See PR Review Bottlenecks.

Step 6: Continuously tighten the limit (Week 5+)

Once the team is comfortable with two-day branches, reduce the limit to one day. Then push toward integrating multiple times per day. Each reduction surfaces new problems - features that are hard to decompose, tests that are slow, reviews that are bottlenecked - and each problem is worth solving because it blocks the flow of work.

The goal is continuous integration: every developer integrates to trunk at least once per day. At that point, “branches” are just short-lived workspaces that exist for hours, and merging is a non-event.

Measuring Progress

Metric	What to look for
Average branch lifetime	Should decrease to under one day
Maximum branch lifetime	No branch should exceed two days
Integration frequency	Should increase toward at least daily per developer
Merge conflict frequency	Should decrease as branches get shorter
Merge duration	Should decrease from hours to minutes
Development cycle time	Should decrease as integration overhead drops
Lines changed per merge	Should decrease as changes get smaller

Trunk-Based Development - The branching model that eliminates long-lived branches
Code Review - Small changes enable fast reviews, which enable short-lived branches
Small Batches - The principle behind breaking large features into daily integrations
Work Decomposition - Techniques for breaking features into small, mergeable increments
PR Review Bottlenecks - Slow reviews are a common reason branches live too long

1.2.2 - No Continuous Integration

The build has been red for weeks and nobody cares. “CI” means a build server exists, not that anyone actually integrates continuously.

Category: Branching & Integration | Quality Impact: Critical

What This Looks Like

The team has a build server. It runs after every push. There is a dashboard somewhere that shows build status. But the build has been red for three weeks and nobody has mentioned it. Developers push code, glance at the result if they remember, and move on. When someone finally investigates, the failure is in a test that broke weeks ago and nobody can remember which commit caused it.

The word “continuous” has lost its meaning. Developers do not integrate their work into trunk daily - they work on branches for days or weeks and merge when the feature feels done. The build server runs, but nobody treats a red build as something that must be fixed immediately. There is no shared agreement that trunk should always be green. “CI” is a tool in the infrastructure, not a practice the team follows.

Common variations:

The build server with no standards. A CI server runs on every push, but there are no rules about what happens when it fails. Some developers fix their failures. Others do not. The build flickers between green and red all day, and nobody trusts the signal.
The nightly build. The build runs once per day, overnight. Developers find out the next morning whether yesterday’s work broke something. By then they have moved on to new work and lost context on what they changed.
The “CI” that is just compilation. The build server compiles the code and nothing else. No tests run. No static analysis. The build is green as long as the code compiles, which tells the team almost nothing about whether the software works.
The manually triggered build. The build server exists, but it does not run on push. After pushing code, the developer must log into the CI server and manually start the build and tests. When developers are busy or forget, their changes sit untested. When multiple pushes happen between triggers, a failure could belong to any of them. The feedback loop depends entirely on developer discipline rather than automation.
The branch-only build. CI runs on feature branches but not on trunk. Each branch builds in isolation, but nobody knows whether the branches work together until merge day. Trunk is not continuously validated.
The ignored dashboard. The CI dashboard exists but is not displayed anywhere the team can see it. Nobody checks it unless they are personally waiting for a result. Failures accumulate silently.

The telltale sign: if you can ask “how long has the build been red?” and nobody knows the answer, continuous integration is not happening.

Why This Is a Problem

Continuous integration is not a tool - it is a practice. The practice requires that every developer integrates to a shared trunk at least once per day and that the team treats a broken build as the highest-priority problem. Without the practice, the build server is just infrastructure generating notifications that nobody reads.

It reduces quality

When the build is allowed to stay red, the team loses its only automated signal that something is wrong. A passing build is supposed to mean “the software works as tested.” A failing build is supposed to mean “stop and fix this before doing anything else.” When failures are ignored, that signal becomes meaningless. Developers learn that a red build is background noise, not an alarm.

Once the build signal is untrusted, defects accumulate. A developer introduces a bug on Monday. The build fails, but it was already red from an unrelated failure, so nobody notices. Another developer introduces a different bug on Tuesday. By Friday, trunk has multiple interacting defects and nobody knows when they were introduced or by whom. Debugging becomes archaeology.

When the team practices continuous integration, a red build is rare and immediately actionable. The developer who broke it knows exactly which change caused the failure because they committed minutes ago. The fix is fast because the context is fresh. Defects are caught individually, not in tangled clusters.

It increases rework

Without continuous integration, developers work in isolation for days or weeks. Each developer assumes their code works because it passes on their machine or their branch. But they are building on assumptions about shared code that may already be outdated. When they finally integrate, they discover that someone else changed an API they depend on, renamed a class they import, or modified behavior they rely on.

The rework cascade is predictable. Developer A changes a shared interface on Monday. Developer B builds three days of work on the old interface. On Thursday, developer B tries to integrate and discovers the conflict. Now they must rewrite three days of code to match the new interface. If they had integrated on Monday, the conflict would have been a five-minute fix.

Teams that integrate continuously discover conflicts within hours, not days. The rework is measured in minutes because the conflicting changes are small and the developers still have full context on both sides. The total cost of integration stays low and constant instead of spiking unpredictably.

It makes delivery timelines unpredictable

A team without continuous integration cannot answer the question “is the software releasable right now?” Trunk may or may not compile. Tests may or may not pass. The last successful build may have been a week ago. Between then and now, dozens of changes have landed without anyone verifying that they work together.

This creates a stabilization period before every release. The team stops feature work, fixes the build, runs the test suite, and triages failures. This stabilization takes an unpredictable amount of time - sometimes a day, sometimes a week - because nobody knows how many problems have accumulated since the last known-good state.

With continuous integration, trunk is always in a known state. If the build is green, the team can release. If the build is red, the team knows exactly which commit broke it and how long ago. There is no stabilization period because the code is continuously stabilized. Release readiness is a fact that can be checked at any moment, not a state that must be achieved through a dedicated effort.

It masks the true cost of integration problems

When the build is permanently broken or rarely checked, the team cannot see the patterns that would tell them where their process is failing. Is the build slow? Nobody notices because nobody waits for it. Are certain tests flaky? Nobody notices because failures are expected. Do certain parts of the codebase cause more breakage than others? Nobody notices because nobody correlates failures to changes.

These hidden problems compound. The build gets slower because nobody is motivated to speed it up. Flaky tests multiply because nobody quarantines them. Brittle areas of the codebase stay brittle because the feedback that would highlight them is lost in the noise.

When the team practices CI and treats a red build as an emergency, every friction point becomes visible. A slow build annoys the whole team daily, creating pressure to optimize it. A flaky test blocks everyone, creating pressure to fix or remove it. The practice surfaces the problems. Without the practice, the problems are invisible and grow unchecked.

Impact on continuous delivery

Continuous integration is the foundation that every other CD practice is built on. Without it, the pipeline cannot give fast, reliable feedback on every change. Automated testing is pointless if nobody acts on the results. Deployment automation is pointless if the artifact being deployed has not been validated. Small batches are pointless if the batches are never verified to work together.

A team that does not practice CI cannot practice CD. The two are not independent capabilities that can be adopted in any order. CI is the prerequisite. Every hour that the build stays red is an hour during which the team has no automated confidence that the software works. Continuous delivery requires that confidence to exist at all times.

How to Fix It

Step 1: Fix the build and agree it stays green (Week 1)

Before anything else, get trunk to green. This is the team’s first and most important commitment.

Assign the broken build as the highest-priority work item. Stop feature work if necessary.
Triage every failure: fix it, quarantine it to a non-blocking suite, or delete the test if it provides no value.
Once the build is green, make the team agreement explicit: a red build is the team’s top priority. Whoever broke it fixes it. If they cannot fix it within 15 minutes, they revert their change and try again with a smaller commit.

Write this agreement down. Put it in the team’s working agreements document. If you do not have one, start one now. The agreement is simple: we do not commit on top of a red build, and we do not leave a red build for someone else to fix.

Step 2: Make the build visible (Week 1)

The build status must be impossible to ignore:

Display the build dashboard on a large monitor visible to the whole team.
Configure notifications so that a broken build alerts the team immediately - in the team chat channel, not in individual email inboxes.
If the build breaks, the notification should identify the commit and the committer.

Visibility creates accountability. When the whole team can see that the build broke at 2:15 PM and who broke it, social pressure keeps people attentive. When failures are buried in email notifications, they are easily ignored.

Step 3: Require integration at least once per day (Week 2)

The “continuous” in continuous integration means at least daily, and ideally multiple times per day. Set the expectation:

Every developer integrates their work to trunk at least once per day.
If a developer has been working on a branch for more than a day without integrating, that is a problem to discuss at standup.
Track integration frequency per developer per day. Make it visible alongside the build dashboard.

This will expose problems. Some developers will say their work is not ready to integrate. That is a decomposition problem - the work is too large. Some will say they cannot integrate because the build is too slow. That is a pipeline problem. Each problem is worth solving. See Long-Lived Feature Branches for techniques to break large work into daily integrations.

Step 4: Make the build fast enough to provide useful feedback (Weeks 2-3)

A build that takes 45 minutes is a build that developers will not wait for. Target under 10 minutes for the primary feedback loop:

Identify the slowest stages and optimize or parallelize them.
Move slow integration tests to a secondary pipeline that runs after the fast suite passes.
Add build caching so that unchanged dependencies are not recompiled on every run.
Run tests in parallel if they are not already.

The goal is a fast feedback loop: the developer pushes, waits a few minutes, and knows whether their change works with everything else. If they have to wait 30 minutes, they will context-switch, and the feedback loop breaks.

Step 5: Address the objections (Weeks 3-4)

Objection	Response
“The build is too slow to fix every red immediately”	Then the build is too slow, and that is a separate problem to solve. A slow build is not a reason to ignore failures - it is a reason to invest in making the build faster.
“Some tests are flaky - we can’t treat every failure as real”	Quarantine flaky tests into a non-blocking suite. The blocking suite must be deterministic. If a test in the blocking suite fails, it is real until proven otherwise.
“We can’t integrate daily - our features take weeks”	The features take weeks. The integrations do not have to. Use branch by abstraction, feature flags, or vertical slicing to integrate partial work daily.
“Fixing someone else’s broken build is not my job”	It is the whole team’s job. A red build blocks everyone. If the person who broke it is unavailable, someone else should revert or fix it. The team owns the build, not the individual.
“We have CI - the build server runs on every push”	A build server is not CI. CI is the practice of integrating frequently and keeping the build green. If the build has been red for a week, you have a build server, not continuous integration.

Step 6: Build the habit (Week 4+)

Continuous integration is a daily discipline, not a one-time setup. Reinforce the habit:

Review integration frequency in retrospectives. If it is dropping, ask why.
Celebrate streaks of consecutive green builds. Make it a point of team pride.
When a developer reverts a broken commit quickly, recognize it as the right behavior - not as a failure.
Periodically audit the build: is it still fast? Are new flaky tests creeping in? Is the test coverage meaningful?

The goal is a team culture where a red build feels wrong - like an alarm that demands immediate attention. When that instinct is in place, CI is no longer a process being followed. It is how the team works.

Measuring Progress

Metric	What to look for
Build pass rate	Percentage of builds that pass on first run - should be above 95%
Time to fix a broken build	Should be under 15 minutes, with revert as the fallback
Integration frequency	At least one integration per developer per day
Build duration	Should be under 10 minutes for the primary feedback loop
Longest period with a red build	Should be measured in minutes, not hours or days
Development cycle time	Should decrease as integration overhead drops and stabilization periods disappear

Trunk-Based Development - CI requires integrating to a shared trunk, not just building branches
Build Automation - The pipeline infrastructure that CI depends on
Testing Fundamentals - Fast, reliable tests are essential for a CI build that teams trust
Long-Lived Feature Branches - Long branches prevent daily integration and are both a cause and symptom of missing CI
Working Agreements - The team agreement to keep the build green must be explicit

1.3 - Testing

Anti-patterns in test strategy, test architecture, and quality practices that block continuous delivery.

These anti-patterns affect how teams build confidence that their code is safe to deploy. They create slow pipelines, flaky feedback, and manual gates that prevent the continuous flow of changes to production.

1.3.1 - No Test Automation

Zero automated tests. The team has no idea where to start and the codebase was not designed for testability.

Category: Testing & Quality | Quality Impact: Critical

What This Looks Like

The team deploys by manually verifying things work. Someone clicks through the application, checks a few screens, and declares it good. There is no test suite. No test runner configured. No test directory in the repository. The CI server, if one exists, builds the code and stops there.

When a developer asks “how do I know if my change broke something?” the answer is either “you don’t” or “someone from QA will check it.” Bugs discovered in production are treated as inevitable. Nobody connects the lack of automated tests to the frequency of production incidents because there is no baseline to compare against.

Common variations:

Tests exist but are never run. Someone wrote tests a year ago. The test suite is broken and nobody has fixed it. The tests are checked into the repository but are not part of any pipeline or workflow.
Manual test scripts as the safety net. A spreadsheet or wiki page lists hundreds of manual test cases. Before each release, someone walks through them by hand. The process takes days. It is the only verification the team has.
Testing is someone else’s job. Developers write code. A separate QA team tests it days or weeks later. The feedback loop is so long that developers have moved on to other work by the time defects are found.
“The code is too legacy to test.” The team has decided the codebase is untestable. Functions are thousands of lines long, everything depends on global state, and there are no seams where test doubles could be inserted. This belief becomes self-fulfilling - nobody tries because everyone agrees it is impossible.

The telltale sign: when a developer makes a change, the only way to verify it works is to deploy it and see what happens.

Why This Is a Problem

Without automated tests, every change is a leap of faith. The team has no fast, reliable way to know whether code works before it reaches users. Every downstream practice that depends on confidence in the code - continuous integration, automated deployment, frequent releases - is blocked.

It reduces quality

When there are no automated tests, defects are caught by humans or by users. Humans are slow, inconsistent, and unable to check everything. A manual tester cannot verify 500 behaviors in an hour, but an automated suite can. The behaviors that are not checked are the ones that break.

Developers writing code without tests have no feedback on whether their logic is correct until someone else exercises it. A function that handles an edge case incorrectly will not be caught until a user hits that edge case in production. By then, the developer has moved on and lost context on the code they wrote.

With even a basic suite of automated tests, developers get feedback in minutes. They catch their own mistakes while the code is fresh. The suite runs the same checks every time, never forgetting an edge case and never getting tired.

It increases rework

Without tests, rework comes from two directions. First, bugs that reach production must be investigated, diagnosed, and fixed - work that an automated test would have prevented. Second, developers are afraid to change existing code because they have no way to verify they have not broken something. This fear leads to workarounds: copy-pasting code instead of refactoring, adding conditional branches instead of restructuring, and building new modules alongside old ones instead of modifying what exists.

Over time, the codebase becomes a patchwork of workarounds layered on workarounds. Each change takes longer because the code is harder to understand and more fragile. The absence of tests is not just a testing problem - it is a design problem that compounds with every change.

Teams with automated tests refactor confidently. They rename functions, extract modules, and simplify logic knowing that the test suite will catch regressions. The codebase stays clean because changing it is safe.

It makes delivery timelines unpredictable

Without automated tests, the time between “code complete” and “deployed” is dominated by manual verification. How long that verification takes depends on how many changes are in the batch, how available the testers are, and how many defects they find. None of these variables are predictable.

A change that a developer finishes on Monday might not be verified until Thursday. If defects are found, the cycle restarts. Lead time from commit to production is measured in weeks, and the variance is enormous. Some changes take three days, others take three weeks, and the team cannot predict which.

Automated tests collapse the verification step to minutes. The time from “code complete” to “verified” becomes a constant, not a variable. Lead time becomes predictable because the largest source of variance has been removed.

Impact on continuous delivery

Automated tests are the foundation of continuous delivery. Without them, there is no automated quality gate. Without an automated quality gate, there is no safe way to deploy frequently. Without frequent deployment, there is no fast feedback from production. Every CD practice assumes that the team can verify code quality automatically. A team with no test automation is not on a slow path to CD - they have not started.

How to Fix It

Starting test automation on an untested codebase feels overwhelming. The key is to start small, establish the habit, and expand coverage incrementally. You do not need to test everything before you get value - you need to test something and keep going.

Step 1: Set up the test infrastructure (Week 1)

Before writing a single test, make it trivially easy to run tests:

Choose a test framework for your primary language. Pick the most popular one - do not deliberate.
Add the framework to the project. Configure it. Write a single test that asserts true == true and verify it passes.
Add a test script or command to the project so that anyone can run the suite with a single command (e.g., npm test, pytest, mvn test).
Add the test command to the CI pipeline so that tests run on every push.

The goal for week one is not coverage. It is infrastructure: a working test runner in the pipeline that the team can build on.

Step 2: Write tests for every new change (Week 2+)

Establish a team rule: every new change must include at least one automated test. Not “every new feature” - every change. Bug fixes get a regression test that fails without the fix and passes with it. New functions get a test that verifies the core behavior. Refactoring gets a test that pins the existing behavior before changing it.

This rule is more important than retroactive coverage. New code enters the codebase tested. The tested portion grows with every commit. After a few months, the most actively changed code has coverage, which is exactly where coverage matters most.

Step 3: Target high-change areas for retroactive coverage (Weeks 3-6)

Use your version control history to find the files that change most often. These are the files where bugs are most likely and where tests provide the most value:

List the 10 files with the most commits in the last six months.
For each file, write tests for its core public behavior. Do not try to test every line - test the functions that other code depends on.
If the code is hard to test because of tight coupling, wrap it. Create a thin adapter around the untestable code and test the adapter. This is the Strangler Fig pattern applied to testing.

Step 4: Make untestable code testable incrementally (Weeks 4-8)

If the codebase resists testing, introduce seams one at a time:

Problem	Technique
Function does too many things	Extract the pure logic into a separate function and test that
Hard-coded database calls	Introduce a repository interface, inject it, test with a fake
Global state or singletons	Pass dependencies as parameters instead of accessing globals
No dependency injection	Start with “poor man’s DI” - default parameters that can be overridden in tests

You do not need to refactor the entire codebase. Each time you touch a file, leave it slightly more testable than you found it.

Step 5: Set a coverage floor and ratchet it up (Week 6+)

Once you have meaningful coverage in actively changed code, set a coverage threshold in the pipeline:

Measure current coverage. Say it is 15%.
Set the pipeline to fail if coverage drops below 15%.
Every two weeks, raise the floor by 2-5 percentage points.

The floor prevents backsliding. The ratchet ensures progress. The team does not need to hit 90% coverage - they need to ensure that coverage only goes up.

Objection	Response
“The codebase is too legacy to test”	You do not need to test the legacy code directly. Wrap it in testable adapters and test those. Every new change gets a test. Coverage grows from the edges inward.
“We don’t have time to write tests”	You are already spending that time on manual verification and production debugging. Tests shift that cost to the left where it is cheaper. Start with one test per change - the overhead is minutes, not hours.
“We need to test everything before it’s useful”	One test that catches one regression is more useful than zero tests. The value is immediate and cumulative. You do not need full coverage to start getting value.
“Developers don’t know how to write tests”	Pair a developer who has testing experience with one who does not. If nobody on the team has experience, invest one day in a testing workshop. The skill is learnable in a week.

Measuring Progress

Metric	What to look for
Test count	Should increase every sprint
Code coverage of actively changed files	More meaningful than overall coverage - focus on files changed in the last 30 days
Build duration	Should increase slightly as tests are added, but stay under 10 minutes
Defects found in production vs. in tests	Ratio should shift toward tests over time
Change fail rate	Should decrease as test coverage catches regressions before deployment
Manual testing effort per release	Should decrease as automated tests replace manual verification

Testing Fundamentals - How to build a test strategy for CD
Build Automation - Tests need a pipeline to run in
Inverted Test Pyramid - The next problem to solve once you have tests
Manual Regression Testing Gates - The manual testing this replaces
Deterministic Pipeline - Tests as automated quality gates

1.3.2 - Manual Regression Testing Gates

Every release requires days or weeks of manual testing. Testers execute scripted test cases. Test effort scales linearly with application size.

Category: Testing & Quality | Quality Impact: Critical

What This Looks Like

Before every release, the team enters a testing phase. Testers open a spreadsheet or test management tool containing hundreds of scripted test cases. They walk through each one by hand: click this button, enter this value, verify this result. The testing takes days. Sometimes it takes weeks. Nothing ships until every case is marked pass or fail, and every failure is triaged.

Developers stop working on new features during this phase because testers need a stable build to test against. Code freezes go into effect. Bug fixes discovered during testing must be applied carefully to avoid invalidating tests that have already passed. The team enters a holding pattern where the only work that matters is getting through the test cases.

The testing effort grows with every release. New features add new test cases, but old test cases are rarely removed because nobody is confident they are redundant. A team that tested for three days six months ago now tests for five. The spreadsheet has 800 rows. Every release takes longer to validate than the last.

Common variations:

The regression spreadsheet. A master spreadsheet of every test case the team has ever written. Before each release, a tester works through every row. The spreadsheet is the institutional memory of what the software is supposed to do, and nobody trusts anything else.
The dedicated test phase. The sprint cadence is two weeks of development followed by one week of testing. The test week is a mini-waterfall phase embedded in an otherwise agile process. Nothing can ship until the test phase is complete.
The test environment bottleneck. Manual testing requires a specific environment that is shared across teams. The team must wait for their slot. When the environment is broken by another team’s testing, everyone waits for it to be restored.
The sign-off ceremony. A QA lead or manager must personally verify a subset of critical paths and sign a document before the release can proceed. If that person is on vacation, the release waits.
The compliance-driven test cycle. Regulatory requirements are interpreted as requiring manual execution of every test case with documented evidence. Each test run produces screenshots and sign-off forms. The documentation takes as long as the testing itself.

The telltale sign: if the question “can we release today?” is always answered with “not until QA finishes,” manual regression testing is gating your delivery.

Why This Is a Problem

Manual regression testing feels responsible. It feels thorough. But it creates a bottleneck that grows worse with every feature the team builds, and the thoroughness it promises is an illusion.

It reduces quality

Manual testing is less reliable than it appears. A human executing the same test case for the hundredth time will miss things. Attention drifts. Steps get skipped. Edge cases that seemed important when the test was written get glossed over when the tester is on row 600 of a spreadsheet. Studies on manual testing consistently show that testers miss 15-30% of defects that are present in the software they are testing.

The test cases themselves decay. They were written for the version of the software that existed when the feature shipped. As the product evolves, some cases become irrelevant, others become incomplete, and nobody updates them systematically. The team is executing a test plan that partially describes software that no longer exists.

The feedback delay compounds the quality problem. A developer who wrote code two weeks ago gets a bug report from a tester during the regression cycle. The developer has lost context on the change. They re-read their own code, try to remember what they were thinking, and fix the bug with less confidence than they would have had the day they wrote it.

Automated tests catch the same classes of bugs in seconds, with perfect consistency, every time the code changes. They do not get tired on row 600. They do not skip steps. They run against the current version of the software, not a test plan written six months ago. And they give feedback immediately, while the developer still has full context.

It increases rework

The manual testing gate creates a batch-and-queue cycle. Developers write code for two weeks, then testers spend a week finding bugs in that code. Every bug found during the regression cycle is rework: the developer must stop what they are doing, reload the context of a completed story, diagnose the issue, fix it, and send it back to the tester for re-verification. The re-verification may invalidate other test cases, requiring additional re-testing.

The batch size amplifies the rework. When two weeks of changes are tested together, a bug could be in any of dozens of commits. Narrowing down the cause takes longer because there are more variables. When the same bug would have been caught by an automated test minutes after it was introduced, the developer would have fixed it in the same sitting - one context switch instead of many.

The rework also affects testers. A bug fix during the regression cycle means the tester must re-run affected test cases. If the fix changes behavior elsewhere, the tester must re-run those cases too. A single bug fix can cascade into hours of re-testing, pushing the release date further out.

With automated regression tests, bugs are caught as they are introduced. The fix happens immediately. There is no regression cycle, no re-testing cascade, and no context-switching penalty.

It makes delivery timelines unpredictable

The regression testing phase takes as long as it takes. The team cannot predict how many bugs the testers will find, how long each fix will take, or how much re-testing the fixes will require. A release planned for Friday might slip to the following Wednesday. Or the following Friday.

This unpredictability cascades through the organization. Product managers cannot commit to delivery dates because they do not know how long testing will take. Stakeholders learn to pad their expectations. “We’ll release in two weeks” really means “we’ll release in two to four weeks, depending on what QA finds.”

The unpredictability also creates pressure to cut corners. When the release is already three days late, the team faces a choice: re-test thoroughly after a late bug fix, or ship without full re-testing. Under deadline pressure, most teams choose the latter. The manual testing gate that was supposed to ensure quality becomes the reason quality is compromised.

Automated regression suites produce predictable, repeatable results. The suite runs in the same amount of time every time. There is no testing phase to slip. The team knows within minutes of every commit whether the software is releasable.

It creates a permanent scaling problem

Manual testing effort scales linearly with application size. Every new feature adds test cases. The test suite never shrinks. A team that takes three days to test today will take four days in six months and five days in a year. The testing phase consumes an ever-growing fraction of the team’s capacity.

This scaling problem is invisible at first. Three days of testing feels manageable. But the growth is relentless. The team that started with 200 test cases now has 800. The test phase that was two days is now a week. And because the test cases were written by different people at different times, nobody can confidently remove any of them without risking a missed regression.

Automated tests scale differently. Adding a new automated test adds milliseconds to the suite duration, not hours to the testing phase. A team with 10,000 automated tests runs them in the same 10 minutes as a team with 1,000. The cost of confidence is fixed, not linear.

Impact on continuous delivery

Manual regression testing is fundamentally incompatible with continuous delivery. CD requires that any commit can be released at any time. A manual testing gate that takes days means the team can release at most once per testing cycle. If the gate takes a week, the team releases at most every two or three weeks - regardless of how fast their pipeline is or how small their changes are.

The manual gate also breaks the feedback loop that CD depends on. CD gives developers confidence that their change works by running automated checks within minutes. A manual gate replaces that fast feedback with a slow, batched, human process that cannot keep up with the pace of development.

You cannot have continuous delivery with a manual regression gate. The two are mutually exclusive. The gate must be automated before CD is possible.

How to Fix It

Step 1: Catalog your manual test cases and categorize them (Week 1)

Before automating anything, understand what the manual test suite actually covers. For every test case in the regression suite:

Identify what behavior it verifies.
Classify it: is it testing business logic, a UI flow, an integration boundary, or a compliance requirement?
Rate its value: has this test ever caught a real bug? When was the last time?
Rate its automation potential: can this be tested at a lower level (unit, functional, API)?

Most teams discover that a large percentage of their manual test cases are either redundant (the same behavior is tested multiple times), outdated (the feature has changed), or automatable at a lower level.

Step 2: Automate the highest-value cases first (Weeks 2-4)

Pick the 20 test cases that cover the most critical paths - the ones that would cause the most damage if they regressed. Automate them:

Business logic tests become unit tests.
API behavior tests become functional tests.
Critical user journeys become a small set of E2E smoke tests.

Do not try to automate everything at once. Start with the cases that give the most confidence per minute of execution time. The goal is to build a fast automated suite that covers the riskiest scenarios so the team no longer depends on manual execution for those paths.

Step 3: Run automated tests in the pipeline on every commit (Week 3)

Move the new automated tests into the CI pipeline so they run on every push. This is the critical shift: testing moves from a phase at the end of development to a continuous activity that happens with every change.

Every commit now gets immediate feedback on the critical paths. If a regression is introduced, the developer knows within minutes - not weeks.

Step 4: Shrink the manual suite as automation grows (Weeks 4-8)

Each week, pick another batch of manual test cases and either automate or retire them:

Automate cases where the behavior is stable and testable at a lower level.
Retire cases that are redundant with existing automated tests or that test behavior that no longer exists.
Keep manual only for genuinely exploratory testing that requires human judgment - usability evaluation, visual design review, or complex workflows that resist automation.

Track the shrinkage. If the manual suite had 800 cases and now has 400, that is progress. If the manual testing phase took five days and now takes two, that is measurable improvement.

Step 5: Replace the testing phase with continuous testing (Weeks 6-8+)

The goal is to eliminate the dedicated testing phase entirely:

Before	After
Code freeze before testing	No code freeze - trunk is always testable
Testers execute scripted cases	Automated suite runs on every commit
Bugs found days or weeks after coding	Bugs found minutes after coding
Testing phase blocks release	Release readiness checked automatically
QA sign-off required	Pipeline pass is the sign-off
Testers do manual regression	Testers do exploratory testing, write automated tests, and improve test infrastructure

Step 6: Address the objections (Ongoing)

Objection	Response
“Automated tests can’t catch everything a human can”	Correct. But humans cannot execute 800 test cases reliably in a day, and automated tests can. Automate the repeatable checks and free humans for the exploratory testing where their judgment adds value.
“We need manual testing for compliance”	Most compliance frameworks require evidence that testing was performed, not that humans performed it. Automated test reports with pass/fail results, timestamps, and traceability to requirements satisfy most audit requirements better than manual spreadsheets. Confirm with your compliance team.
“Our testers don’t know how to write automated tests”	Pair testers with developers. The tester contributes domain knowledge - what to test and why - while the developer contributes automation skills. Over time, the tester learns automation and the developer learns testing strategy.
“We can’t automate tests for our legacy system”	Start with new code. Every new feature gets automated tests. For legacy code, automate the most critical paths first and expand coverage as you touch each area. The legacy system does not need 100% automation overnight.
“What if we automate a test wrong and miss a real bug?”	Manual tests miss real bugs too - consistently. An automated test that is wrong can be fixed once and stays fixed. A manual tester who skips a step makes the same mistake next time. Automation is not perfect, but it is more reliable and more improvable than manual execution.

Measuring Progress

Metric	What to look for
Manual test case count	Should decrease steadily as cases are automated or retired
Manual testing phase duration	Should shrink toward zero
Automated test count in pipeline	Should grow as manual cases are converted
Release frequency	Should increase as the manual gate shrinks
Development cycle time	Should decrease as the testing phase is eliminated
Time from code complete to release	Should converge toward pipeline duration, not testing phase duration

Testing Fundamentals - The test architecture that replaces manual regression suites
Deterministic Pipeline - Automated tests in the pipeline replace manual gates
Inverted Test Pyramid - Manual regression testing often coexists with an inverted pyramid
Build Automation - The pipeline infrastructure needed to run tests on every commit
Value Stream Mapping - Reveals how much time the manual testing phase adds to lead time

1.3.3 - Flaky Test Suites

Tests randomly pass or fail. Developers rerun the pipeline until it goes green. Nobody trusts the test suite to tell them anything useful.

Category: Testing & Quality | Quality Impact: High

What This Looks Like

A developer pushes a change. The pipeline fails. They look at the failure - it is a test they did not touch, in a module they did not change. They click “rerun.” It passes. They merge.

This happens multiple times a day across the team. Nobody investigates failures on the first occurrence because the odds favor flakiness over a real problem. When someone mentions a test failure in standup, the first question is “did you rerun it?” not “what broke?”

Common variations:

The nightly lottery. The full suite runs overnight. Every morning, a different random subset of tests is red. Someone triages the failures, marks most as flaky, and the team moves on. Real regressions hide in the noise.
The retry-until-green pattern. The pipeline configuration automatically reruns failed tests two or three times. If a test passes on any attempt, it counts as passed. The team considers this solved. In reality, it masks failures and doubles or triples pipeline duration.
The “known flaky” tag. Tests are annotated with a skip or known-flaky marker. The suite ignores them. The list grows over time. Nobody goes back to fix them because they are out of sight.
Environment-dependent failures. Tests pass on developer machines but fail in CI, or pass in CI but fail on Tuesdays. The failures correlate with shared test environments, time-of-day load patterns, or external service availability.
Test order dependency. Tests pass when run in a specific order but fail when run in isolation or in a different sequence. Shared mutable state from one test leaks into another.

The telltale sign: the team has a shared understanding that the first pipeline failure “doesn’t count.” Rerunning the pipeline is a routine step, not an exception.

Why This Is a Problem

Flaky tests are not a minor annoyance. They systematically destroy the value of the test suite by making it impossible to distinguish signal from noise. A test suite that sometimes lies is worse than no test suite at all, because it creates an illusion of safety.

It reduces quality

When tests fail randomly, developers stop trusting them. The rational response to a flaky suite is to ignore failures - and that is exactly what happens. A developer whose pipeline fails three times a week for reasons unrelated to their code learns to click “rerun” without reading the error message.

This behavior is invisible most of the time. It becomes catastrophic when a real regression happens. The test that catches the regression fails, the developer reruns because “it’s probably flaky,” it passes on the second run because the flaky behavior went the other way, and the regression ships to production. The test did its job, but the developer’s trained behavior neutralized it.

In a suite with zero flaky tests, every failure demands investigation. Developers read the error, find the cause, and fix it. Failures are rare and meaningful. The suite functions as a reliable quality gate.

It increases rework

Flaky tests cause rework in two ways. First, developers spend time investigating failures that turn out to be noise. A developer sees a test failure, spends 20 minutes reading the error and reviewing their change, realizes the failure is unrelated, and reruns. Multiply this by every developer on the team, multiple times per day.

Second, the retry-until-green pattern extends pipeline duration. A pipeline that should take 8 minutes takes 20 because failed tests are rerun automatically. Developers wait longer for feedback, context-switch more, and lose more time to task switching.

Teams with deterministic test suites waste zero time investigating flaky failures. Their pipeline runs once, gives an answer, and the developer acts on it.

It makes delivery timelines unpredictable

A flaky suite introduces randomness into the delivery process. The same code, submitted twice, might pass the pipeline on the first attempt or take three reruns. Lead time from commit to merge varies not because of code quality but because of test noise.

When the team needs to ship urgently, flaky tests become a source of anxiety. “Will the pipeline pass this time?” The team starts planning around the flakiness - running the pipeline early “in case it fails,” avoiding changes late in the day because there might not be time for reruns. The delivery process is shaped by the unreliability of the tests rather than by the quality of the code.

Deterministic tests make delivery time a function of code quality alone. The pipeline is a predictable step that takes the same amount of time every run. There are no surprises.

It normalizes ignoring failures

The most damaging effect of flaky tests is cultural. Once a team accepts that test failures are often noise, the standard for investigating failures drops permanently. New team members learn from day one that “you just rerun it.” The bar for adding a flaky test to the suite is low because one more flaky test is barely noticeable when there are already dozens.

This normalization extends beyond tests. If the team tolerates unreliable automated checks, they will tolerate unreliable monitoring, unreliable alerts, and unreliable deploys. Flaky tests teach the team that automation is not trustworthy - exactly the opposite of what CD requires.

Impact on continuous delivery

Continuous delivery depends on automated quality gates that the team trusts completely. A flaky suite is a quality gate with a broken lock - it looks like it is there, but it does not actually stop anything. Developers bypass it by rerunning. Regressions pass through it by luck.

The pipeline must be a machine that answers one question with certainty: “Is this change safe to deploy?” A flaky suite answers “probably, maybe, rerun and ask again.” That is not a foundation you can build continuous delivery on.

How to Fix It

Step 1: Measure the flakiness (Week 1)

Before fixing anything, quantify the problem:

Collect pipeline run data for the last 30 days. Count the number of runs that failed and were rerun without code changes.
Identify which specific tests failed across those reruns. Rank them by failure frequency.
Calculate the pipeline pass rate: what percentage of first-attempt runs succeed?

This gives you a hit list and a baseline. If your first-attempt pass rate is 60%, you know 40% of pipeline runs are wasted on flaky failures.

Step 2: Quarantine the worst offenders (Week 1)

Take the top 10 flakiest tests and move them out of the pipeline-gating suite immediately. Do not fix them yet - just remove them from the critical path.

Move them to a separate test suite that runs on a schedule (nightly or hourly) but does not block merges.
Create a tracking issue for each quarantined test with its failure rate and the suspected cause.

This immediately improves pipeline reliability. The team sees fewer false failures on day one.

Step 3: Fix or replace quarantined tests (Weeks 2-4)

Work through the quarantined tests systematically. For each one, identify the root cause:

Root cause	Fix
Shared mutable state (database, filesystem, cache)	Isolate test data. Each test creates and destroys its own state. Use transactions or test containers.
Timing dependencies (sleep, setTimeout, polling)	Replace time-based waits with event-based waits. Wait for a condition, not a duration.
Test order dependency	Ensure each test is self-contained. Run tests in random order to surface hidden dependencies.
External service dependency	Replace with a test double. Validate the double with a contract test.
Race conditions in async code	Use deterministic test patterns. Await promises. Avoid fire-and-forget in test code.
Resource contention (ports, files, shared environments)	Allocate unique resources per test. Use random ports. Use temp directories.

For each quarantined test, either fix it and return it to the gating suite or replace it with a deterministic lower-level test that covers the same behavior.

Step 4: Prevent new flaky tests from entering the suite (Week 3+)

Establish guardrails so the problem does not recur:

Run new tests 10 times in CI before merging them. If any run fails, the test is flaky and must be fixed before it enters the suite.
Run the full suite in random order. This surfaces order-dependent tests immediately.
Track the pipeline first-attempt pass rate as a team metric. Make it visible on a dashboard. Set a target (e.g., 95%) and treat drops below the target as incidents.
Add a team working agreement: flaky tests are treated as bugs with the same priority as production defects.

Step 5: Eliminate automatic retries (Week 4+)

If the pipeline is configured to automatically retry failed tests, turn it off. Retries mask flakiness instead of surfacing it. Once the quarantine and prevention steps are in place, the suite should be reliable enough to run once.

If a test fails, it should mean something. Retries teach the team that failures are meaningless.

Objection	Response
“Retries are fine - they handle transient issues”	Transient issues in a test suite are a symptom of external dependencies or shared state. Fix the root cause instead of papering over it with retries.
“We don’t have time to fix flaky tests”	Calculate the time the team spends rerunning pipelines and investigating false failures. It is almost always more than the time to fix the flaky tests.
“Some flakiness is inevitable with E2E tests”	That is an argument for fewer E2E tests, not for tolerating flakiness. Push the test down to a level where it can be deterministic.
“The flaky test sometimes catches real bugs”	A test that catches real bugs 5% of the time and false-alarms 20% of the time is a net negative. Replace it with a deterministic test that catches the same bugs 100% of the time.

Measuring Progress

Metric	What to look for
Pipeline first-attempt pass rate	Should climb toward 95%+
Number of quarantined tests	Should decrease to zero as tests are fixed or replaced
Pipeline reruns per week	Should drop to near zero
Build duration	Should decrease as retries are removed
Development cycle time	Should decrease as developers stop waiting for reruns
Developer trust survey	Ask quarterly: “Do you trust the test suite to catch real problems?” Answers should improve.

Testing Fundamentals - Building a reliable test strategy from the ground up
Inverted Test Pyramid - Flakiness often stems from too many E2E tests
Deterministic Pipeline - The pipeline must give the same answer every time
Build Automation - The pipeline infrastructure tests run in
Working Agreements - Team norms for test reliability

1.3.4 - Inverted Test Pyramid

Most tests are slow end-to-end or UI tests. Few unit tests. The test suite is slow, brittle, and expensive to maintain.

Category: Testing & Quality | Quality Impact: High

What This Looks Like

The team has tests, but the wrong kind. Running the full suite takes 30 minutes or more. Tests fail randomly. Developers rerun the pipeline and hope for green. When a test fails, the first question is “is that a real failure or a flaky test?” rather than “what did I break?”

Common variations:

The ice cream cone. Most testing is manual. Below that, a large suite of end-to-end browser tests. A handful of integration tests. Almost no unit tests. The manual testing takes days, the E2E suite takes hours, and nothing runs fast enough to give developers feedback while they code.
The E2E-first approach. The team believes end-to-end tests are “real” tests because they test the “whole system.” Unit tests are dismissed as “not testing anything useful” because they use mocks. The result is a suite of 500 Selenium tests that take 45 minutes and fail 10% of the time.
The integration test swamp. Every test boots a real database, calls real services, and depends on shared test environments. Tests are slow because they set up and tear down heavy infrastructure. They are flaky because they depend on network availability and shared mutable state.
The UI test obsession. The team writes tests exclusively through the UI layer. Business logic that could be verified in milliseconds with a unit test is instead tested through a full browser automation flow that takes seconds per assertion.
The “we have coverage” illusion. Code coverage is high because the E2E tests exercise most code paths. But the tests are so slow and brittle that developers do not run them locally. They push code and wait 40 minutes to learn if it works. If a test fails, they assume it is flaky and rerun.

The telltale sign: developers do not trust the test suite. They push code and go get coffee. When tests fail, they rerun before investigating. When a test is red for days, nobody is alarmed.

Why This Is a Problem

An inverted test pyramid does not just slow the team down. It actively undermines every benefit that testing is supposed to provide.

The suite is too slow to give useful feedback

The purpose of a test suite is to tell developers whether their change works - fast enough that they can act on the feedback while they still have context. A suite that runs in seconds gives feedback during development. A suite that runs in minutes gives feedback before the developer moves on. A suite that runs in 30 or more minutes gives feedback after the developer has started something else entirely.

When the suite takes 40 minutes, developers do not run it locally. They push to CI and context- switch to a different task. When the result comes back, they have lost the mental model of the code they changed. Investigating a failure takes longer because they have to re-read their own code. Fixing the failure takes longer because they are now juggling two streams of work.

A well-structured suite - heavy on unit tests, light on E2E - runs in under 10 minutes. Developers run it locally before pushing. Failures are caught while the code is still fresh. The feedback loop is tight enough to support continuous integration.

Flaky tests destroy trust

End-to-end tests are inherently non-deterministic. They depend on network connectivity, shared test environments, external service availability, browser rendering timing, and dozens of other factors outside the developer’s control. A test that fails because a third-party API was slow for 200 milliseconds looks identical to a test that fails because the code is wrong.

When 10% of the suite fails randomly on any given run, developers learn to ignore failures. They rerun the pipeline, and if it passes the second time, they assume the first failure was noise. This behavior is rational given the incentives, but it is catastrophic for quality. Real failures hide behind the noise. A test that detects a genuine regression gets rerun and ignored alongside the flaky tests.

Unit tests and functional tests with test doubles are deterministic. They produce the same result every time. When a deterministic test fails, the developer knows with certainty that they broke something. There is no rerun. There is no “is that real?” The failure demands investigation.

Maintenance cost grows faster than value

End-to-end tests are expensive to write and expensive to maintain. A single E2E test typically involves:

Setting up test data across multiple services
Navigating through UI flows with waits and retries
Asserting on UI elements that change with every redesign
Handling timeouts, race conditions, and flaky selectors

When a feature changes, every E2E test that touches that feature must be updated. A redesign of the checkout page breaks 30 E2E tests even if the underlying behavior has not changed. The team spends more time maintaining E2E tests than writing new features.

Unit tests are cheap to write and cheap to maintain. They test behavior, not UI layout. A function that calculates a discount does not care whether the button is blue or green. When the discount logic changes, one or two unit tests need updating - not thirty browser flows.

It couples your pipeline to external systems

When most of your tests are end-to-end or integration tests that hit real services, your ability to deploy depends on every system in the chain being available and healthy. If the payment provider’s sandbox is down, your pipeline fails. If the shared staging database is slow, your tests time out. If another team deployed a breaking change to a shared service, your tests fail even though your code is correct.

This is the opposite of what CD requires. Continuous delivery demands that your team can deploy independently, at any time, regardless of the state of external systems. A test architecture built on E2E tests makes your deployment hostage to every dependency in your ecosystem.

A suite built on unit tests, functional tests, and contract tests runs entirely within your control. External dependencies are replaced with test doubles that are validated by contract tests. Your pipeline can tell you “this change is safe to deploy” even if every external system is offline.

Impact on continuous delivery

The inverted pyramid makes CD impossible in practice even if all the other pieces are in place. The pipeline takes too long to support frequent integration. Flaky failures erode trust in the automated quality gates. Developers bypass the tests or batch up changes to avoid the wait. The team gravitates toward manual verification before deploying because they do not trust the automated suite.

A team that deploys weekly with a 40-minute flaky suite cannot deploy daily without either fixing the test architecture or abandoning automated quality gates. Neither option is acceptable. Fixing the architecture is the only sustainable path.

How to Fix It

Inverting the pyramid does not mean deleting all your E2E tests and writing unit tests from scratch. It means shifting the balance deliberately over time so that most confidence comes from fast, deterministic tests and only a small amount comes from slow, non-deterministic ones.

Step 1: Audit your current test distribution (Week 1)

Count your tests by type and measure their characteristics:

Test type	Count	Total duration	Flaky?	Requires external systems?
Unit	?	?	?	?
Integration	?	?	?	?
Functional	?	?	?	?
E2E	?	?	?	?
Manual	?	N/A	N/A	N/A

Run the full suite three times. Note which tests fail intermittently. Record the total duration. This is your baseline.

Step 2: Quarantine flaky tests immediately (Week 1)

Move every flaky test out of the pipeline-gating suite into a separate quarantine suite. This is not deleting them - it is removing them from the critical path so that real failures are visible.

For each quarantined test, decide:

Fix it if the behavior it tests is important and the flakiness has a solvable cause (timing dependency, shared state, test order dependency).
Replace it with a faster, deterministic test that covers the same behavior at a lower level.
Delete it if the behavior is already covered by other tests or is not worth the maintenance cost.

Target: zero flaky tests in the pipeline-gating suite by end of week.

Step 3: Push tests down the pyramid (Weeks 2-4)

For each E2E test in your suite, ask: “Can the behavior this test verifies be tested at a lower level?”

Most of the time, the answer is yes. An E2E test that verifies “user can apply a discount code” is actually testing three things:

The discount calculation logic (testable with a unit test)
The API endpoint that accepts the code (testable with a functional test)
The UI flow for entering the code (testable with a component test)

Write the lower-level tests first. Once they exist and pass, the E2E test is redundant for gating purposes. Move it to a post-deployment smoke suite or delete it.

Work through your E2E suite systematically, starting with the slowest and most flaky tests. Each test you push down the pyramid makes the suite faster and more reliable.

Step 4: Replace external dependencies with test doubles (Weeks 2-4)

Identify every test that calls a real external service and replace the dependency:

Dependency type	Test double approach
Database	In-memory database, testcontainers, or repository fakes
External HTTP API	HTTP stubs (WireMock, nock, MSW)
Message queue	In-memory fake or test spy
File storage	In-memory filesystem or temp directory
Third-party service	Stub that returns canned responses

Validate your test doubles with contract tests that run asynchronously. This ensures your doubles stay accurate without coupling your pipeline to external systems.

Step 5: Adopt the test-for-every-change rule (Ongoing)

New code should be tested at the lowest possible level. Establish the team norm:

Every new function with logic gets a unit test.
Every new API endpoint or integration boundary gets a functional test.
E2E tests are only added for critical smoke paths - not for every feature.
Every bug fix includes a regression test at the lowest level that catches the bug.

Over time, this rule shifts the pyramid naturally. New code enters the codebase with the right test distribution even as the team works through the legacy E2E suite.

Step 6: Address the objections

Objection	Response
“Unit tests with mocks don’t test anything real”	They test logic, which is where most bugs live. A discount calculation that returns the wrong number is a real bug whether it is caught by a unit test or an E2E test. The unit test catches it in milliseconds. The E2E test catches it in minutes - if it is not flaky that day.
“E2E tests catch integration bugs that unit tests miss”	Functional tests with test doubles catch most integration bugs. Contract tests catch the rest. The small number of integration bugs that only E2E can find do not justify a suite of hundreds of slow, flaky E2E tests.
“We can’t delete E2E tests - they’re our safety net”	They are a safety net with holes. Flaky tests miss real failures. Slow tests delay feedback. Replace them with faster, deterministic tests that actually catch bugs reliably, then keep a small E2E smoke suite for post-deployment verification.
“Our code is too tightly coupled to unit test”	That is an architecture problem, not a testing problem. Start by writing tests for new code and refactoring existing code as you touch it. Use the Strangler Fig pattern - wrap untestable code in a testable layer.
“We don’t have time to rewrite the test suite”	You are already paying the cost of the inverted pyramid in slow feedback, flaky builds, and manual verification. The fix is incremental: push one test down the pyramid each day. After a month, the suite is measurably faster and more reliable.

Measuring Progress

Metric	What to look for
Test suite duration	Should decrease toward under 10 minutes
Flaky test count in gating suite	Should reach and stay at zero
Test distribution (unit : integration : E2E ratio)	Unit tests should be the largest category
Pipeline pass rate	Should increase as flaky tests are removed
Developers running tests locally	Should increase as the suite gets faster
External dependencies in gating tests	Should reach zero

Testing Fundamentals - The test architecture guide for CD pipelines
Unit Tests - Writing fast, deterministic tests for logic
Functional Tests - Testing your system in isolation with test doubles
Contract Tests - Verifying that test doubles match reality
Test Doubles - Techniques for replacing external dependencies in tests
End-to-End Tests - When and how to use E2E tests appropriately

1.4 - Pipeline and Infrastructure

Anti-patterns in build pipelines, deployment automation, and infrastructure management that block continuous delivery.

These anti-patterns affect the automated path from commit to production. They create manual steps, slow feedback, and fragile deployments that prevent the reliable, repeatable delivery that continuous delivery requires.

1.4.1 - No Pipeline Exists

Builds and deployments are manual processes. Someone runs a script on their laptop. There is no automated path from commit to production.

Category: Pipeline & Infrastructure | Quality Impact: Critical

What This Looks Like

Deploying to production requires a person. Someone opens a terminal, SSHs into a server, pulls the latest code, runs a build command, and restarts a service. Or they download an artifact from a shared drive, copy it to the right server, and run an install script. The steps live in a wiki page, a shared document, or in someone’s head. Every deployment is a manual operation performed by whoever knows the procedure.

There is no automation connecting a code commit to a running system. A developer finishes a feature, pushes to the repository, and then a separate human process begins: someone must decide it is time to deploy, gather the right artifacts, prepare the target environment, execute the deployment, and verify that it worked. Each of these steps involves manual effort and human judgment.

The deployment procedure is a craft. Certain people are known for being “good at deploys.” New team members are warned not to attempt deployments alone. When the person who knows the procedure is unavailable, deployments wait. The team has learned to treat deployment as a risky, specialized activity that requires care and experience.

Common variations:

The deploy script on someone’s laptop. A shell script that automates some steps, but it lives on one developer’s machine. Nobody else has it. When that developer is out, the team either waits or reverse-engineers the procedure from the wiki.
The manual checklist. A document with 30 steps: “SSH into server X, run this command, check this log file, restart this service.” The checklist is usually out of date. Steps are missing or in the wrong order. The person deploying adds corrections in the margins.
The “only Dave can deploy” pattern. One person has the credentials, the knowledge, and the muscle memory to deploy reliably. Deployments are scheduled around Dave’s availability. Dave is a single point of failure and cannot take vacation during release weeks.
The FTP deployment. Build artifacts are uploaded to a server via FTP, SCP, or a file share. The person deploying must know which files go where, which config files to update, and which services to restart. A missed file means a broken deployment.
The manual build. There is no automated build at all. A developer runs the build command locally, checks that it compiles, and copies the output to the deployment target. The build that was tested is not necessarily the build that gets deployed.

The telltale sign: if deploying requires a specific person, a specific machine, or a specific document that must be followed step by step, no pipeline exists.

Why This Is a Problem

The absence of a pipeline means every deployment is a unique event. No two deployments are identical because human hands are involved in every step. This creates risk, waste, and unpredictability that compound with every release.

It reduces quality

Without a pipeline, there is no enforced quality gate between a developer’s commit and production. Tests may or may not be run before deploying. Static analysis may or may not be checked. The artifact that reaches production may or may not be the same artifact that was tested. Every “may or may not” is a gap where defects slip through.

Manual deployments also introduce their own defects. A step skipped in the checklist, a wrong version of a config file, a service restarted in the wrong order - these are deployment bugs that have nothing to do with the code. They are caused by the deployment process itself. The more manual steps involved, the more opportunities for human error.

A pipeline eliminates both categories of risk. Every commit passes through the same automated checks. The artifact that is tested is the artifact that is deployed. There are no skipped steps because the steps are encoded in the pipeline definition and execute the same way every time.

It increases rework

Manual deployments are slow, so teams batch changes to reduce deployment frequency. Batching means more changes per deployment. More changes means harder debugging when something goes wrong, because any of dozens of commits could be the cause. The team spends hours bisecting changes to find the one that broke production.

Failed manual deployments create their own rework. A deployment that goes wrong must be diagnosed, rolled back (if rollback is even possible), and re-attempted. Each re-attempt burns time and attention. If the deployment corrupted data or left the system in a partial state, the recovery effort dwarfs the original deployment.

Rework also accumulates in the deployment procedure itself. Every deployment surfaces a new edge case or a new prerequisite that was not in the checklist. Someone updates the wiki. The next deployer reads the old version. The procedure is never quite right because manual procedures cannot be versioned, tested, or reviewed the way code can.

With an automated pipeline, deployments are fast and repeatable. Small changes deploy individually. Failed deployments are rolled back automatically. The pipeline definition is code - versioned, reviewed, and tested like any other part of the system.

It makes delivery timelines unpredictable

A manual deployment takes an unpredictable amount of time. The optimistic case is 30 minutes. The realistic case includes troubleshooting unexpected errors, waiting for the right person to be available, and re-running steps that failed. A “quick deploy” can easily consume half a day.

The team cannot commit to release dates because the deployment itself is a variable. “We can deploy on Tuesday” becomes “we can start the deployment on Tuesday, and we’ll know by Wednesday whether it worked.” Stakeholders learn that deployment dates are approximate, not firm.

The unpredictability also limits deployment frequency. If each deployment takes hours of manual effort and carries risk of failure, the team deploys as infrequently as possible. This increases batch size, which increases risk, which makes deployments even more painful, which further discourages frequent deployment. The team is trapped in a cycle where the lack of a pipeline makes deployments costly, and costly deployments make the lack of a pipeline seem acceptable.

An automated pipeline makes deployment duration fixed and predictable. A deploy takes the same amount of time whether it happens once a month or ten times a day. The cost per deployment drops to near zero, removing the incentive to batch.

It concentrates knowledge in too few people

When deployment is manual, the knowledge of how to deploy lives in people rather than in code. The team depends on specific individuals who know the servers, the credentials, the order of operations, and the workarounds for known issues. These individuals become bottlenecks and single points of failure.

When the deployment expert is unavailable - sick, on vacation, or has left the company - the team is stuck. Someone else must reconstruct the deployment procedure from incomplete documentation and trial and error. Deployments attempted by inexperienced team members fail at higher rates, which reinforces the belief that only experts should deploy.

A pipeline encodes deployment knowledge in an executable definition that anyone can run. New team members deploy on their first day by triggering the pipeline. The deployment expert’s knowledge is preserved in code rather than in their head. The bus factor for deployments moves from one to the entire team.

Impact on continuous delivery

Continuous delivery requires an automated, repeatable pipeline that can take any commit from trunk and deliver it to production with confidence. Without a pipeline, none of this is possible. There is no automation to repeat. There is no confidence that the process will work the same way twice. There is no path from commit to production that does not require a human to drive it.

The pipeline is not an optimization of manual deployment. It is a prerequisite for CD. A team without a pipeline cannot practice CD any more than a team without source control can practice version management. The pipeline is the foundation. Everything else - automated testing, deployment strategies, progressive rollouts, fast rollback - depends on it existing.

How to Fix It

Step 1: Document the current manual process exactly (Week 1)

Before automating, capture what the team actually does today. Have the person who deploys most often write down every step in order:

What commands do they run?
What servers do they connect to?
What credentials do they use?
What checks do they perform before, during, and after?
What do they do when something goes wrong?

This document is not the solution - it is the specification for the first version of the pipeline. Every manual step will become an automated step.

Step 2: Automate the build (Week 2)

Start with the simplest piece: turning source code into a deployable artifact without manual intervention.

Choose a CI server (Jenkins, GitHub Actions, GitLab CI, CircleCI, or any tool that triggers on commit).
Configure it to check out the code and run the build command on every push to trunk.
Store the build output as a versioned artifact.

At this point, the team has an automated build but still deploys manually. That is fine. The pipeline will grow incrementally.

Step 3: Add automated tests to the build (Week 3)

If the team has any automated tests, add them to the pipeline so they run after the build succeeds. If the team has no automated tests, add one. A single test that verifies the application starts up is more valuable than zero tests.

The pipeline should now fail if the build fails or if any test fails. This is the first automated quality gate. No artifact is produced unless the code compiles and the tests pass.

Step 4: Automate the deployment to a non-production environment (Weeks 3-4)

Take the manual deployment steps from Step 1 and encode them in a script or pipeline stage that deploys the tested artifact to a staging or test environment:

Provision or configure the target environment.
Deploy the artifact.
Run a smoke test to verify the deployment succeeded.

The team now has a pipeline that builds, tests, and deploys to a non-production environment on every commit. Deployments to this environment should happen without any human intervention.

Step 5: Extend the pipeline to production (Weeks 5-6)

Once the team trusts the automated deployment to non-production environments, extend it to production:

Add a manual approval gate if the team is not yet comfortable with fully automated production deployments. This is a temporary step - the goal is to remove it later.
Use the same deployment script and process for production that you use for non-production. The only difference should be the target environment and its configuration.
Add post-deployment verification: health checks, smoke tests, or basic monitoring checks that confirm the deployment is healthy.

The first automated production deployment will be nerve-wracking. That is normal. Run it alongside the manual process the first few times: deploy automatically, then verify manually. As confidence grows, drop the manual verification.

Step 6: Address the objections (Ongoing)

Objection	Response
“Our deployments are too complex to automate”	If a human can follow the steps, a script can execute them. Complex deployments benefit the most from automation because they have the most opportunities for human error.
“We don’t have time to build a pipeline”	You are already spending time on every manual deployment. A pipeline is an investment that pays back on the second deployment and every deployment after.
“Only Dave knows how to deploy”	That is the problem, not a reason to keep the status quo. Building the pipeline captures Dave’s knowledge in code. Dave should lead the pipeline effort because he knows the procedure best.
“What if the pipeline deploys something broken?”	The pipeline includes automated tests and can include approval gates. A broken deployment from a pipeline is no worse than a broken deployment from a human - and the pipeline can roll back automatically.
“Our infrastructure doesn’t support modern CI/CD tools”	Start with a shell script triggered by a cron job or a webhook. A pipeline does not require Kubernetes or cloud-native infrastructure. It requires automation of the steps you already perform manually.

Measuring Progress

Metric	What to look for
Manual steps in the deployment process	Should decrease to zero
Deployment duration	Should decrease and stabilize as manual steps are automated
Release frequency	Should increase as deployment cost drops
Deployment failure rate	Should decrease as human error is removed
People who can deploy to production	Should increase from one or two to the entire team
Lead time	Should decrease as the manual deployment bottleneck is eliminated

Build Automation - The first step in building a pipeline
Pipeline Architecture - How to structure a pipeline from commit to production
Single Path to Production - Every change follows the same automated path
Everything as Code - Pipeline definitions, infrastructure, and deployment procedures belong in version control
Identify Constraints - The absence of a pipeline is often the primary constraint on delivery

1.4.2 - Manual Deployments

The build is automated but deployment is not. Someone must SSH into servers, run scripts, and shepherd each release to production by hand.

Category: Pipeline & Infrastructure | Quality Impact: High

What This Looks Like

The team has a CI server. Code is built and tested automatically on every push. The pipeline dashboard is green. But between “pipeline passed” and “code running in production,” there is a person. Someone must log into a deployment tool, click a button, select the right artifact, choose the right environment, and watch the output scroll by. Or they SSH into servers, pull the artifact, run migration scripts, restart services, and verify health checks - all by hand.

The team may not even think of this as a problem. The build is automated. The tests run automatically. Deployment is “just the last step.” But that last step takes 30 minutes to an hour of focused human attention, can only happen when the right person is available, and fails often enough that nobody wants to do it on a Friday afternoon.

Deployment has its own rituals. The team announces in Slack that a deploy is starting. Other developers stop merging. Someone watches the logs. Another person checks the monitoring dashboard. When it is done, someone posts a confirmation. The whole team holds its breath during the process and exhales when it works. This ceremony happens every time, whether the release is one commit or fifty.

Common variations:

The button-click deploy. The CI/CD tool has a “deploy to production” button, but a human must click it and then monitor the result. The automation exists but is not trusted to run unattended. Someone watches every deployment from start to finish.
The runbook deploy. A document describes the deployment steps in order. The deployer follows the runbook, executing commands manually at each step. The runbook was written months ago and has handwritten corrections in the margins. Some steps have been added, others crossed out.
The SSH-and-pray deploy. The deployer SSHs into each server individually, pulls code or copies artifacts, runs scripts, and restarts services. The order matters. Missing a server means a partial deployment. The deployer keeps a mental checklist of which servers are done.
The release coordinator deploy. One person coordinates the deployment across multiple systems. They send messages to different teams: “deploy service A now,” “run the database migration,” “restart the cache.” The deployment is a choreographed multi-person event.
The after-hours deploy. Deployments happen only outside business hours because the manual process is risky enough that the team wants minimal user traffic. Deployers work evenings or weekends. The deployment window is sacred and stressful.

The telltale sign: if the pipeline is green but the team still needs to “do a deploy” as a separate activity, deployment is manual.

Why This Is a Problem

A manual deployment negates much of the value that an automated build and test pipeline provides. The pipeline can validate code in minutes, but if the last mile to production requires a human, the delivery speed is limited by that human’s availability, attention, and reliability.

It reduces quality

Manual deployment introduces a category of defects that have nothing to do with the code. A deployer who runs migration scripts in the wrong order corrupts data. A deployer who forgets to update a config file on one of four servers creates inconsistent behavior. A deployer who restarts services too quickly triggers a cascade of connection errors. These are process defects - bugs introduced by the deployment method, not the software.

Manual deployments also degrade the quality signal from the pipeline. The pipeline tests a specific artifact in a specific configuration. If the deployer manually adjusts configuration, selects a different artifact version, or skips a verification step, the deployed system no longer matches what the pipeline validated. The pipeline said “this is safe to deploy,” but what actually reached production is something slightly different.

Automated deployment eliminates process defects by executing the same steps in the same order every time. The artifact the pipeline tested is the artifact that reaches production. Configuration is applied from version-controlled definitions, not from human memory. The deployment is identical whether it happens at 2 PM on Tuesday or 3 AM on Saturday.

It increases rework

Because manual deployments are slow and risky, teams batch changes. Instead of deploying each commit individually, they accumulate a week or two of changes and deploy them together. When something breaks in production, the team must determine which of thirty commits caused the problem. This diagnosis takes hours. The fix takes more hours. If the fix itself requires a deployment, the team must go through the manual process again.

Failed deployments are especially costly. A manual deployment that leaves the system in a broken state requires manual recovery. The deployer must diagnose what went wrong, decide whether to roll forward or roll back, and execute the recovery steps by hand. If the deployment was a multi-server process and some servers are on the new version while others are on the old version, the recovery is even harder. The team may spend more time recovering from a failed deployment than they spent on the deployment itself.

With automated deployments, each commit deploys individually. When something breaks, the cause is obvious - it is the one commit that just deployed. Rollback is a single action, not a manual recovery effort. The time from “something is wrong” to “the previous version is running” is minutes, not hours.

It makes delivery timelines unpredictable

The gap between “pipeline is green” and “code is in production” is measured in human availability. If the deployer is in a meeting, the deployment waits. If the deployer is on vacation, the deployment waits longer. If the deployment fails and the deployer needs help, the recovery depends on who else is around.

This human dependency makes release timing unpredictable. The team cannot promise “this fix will be in production in 30 minutes” because the deployment requires a person who may not be available for hours. Urgent fixes wait for deployment windows. Critical patches wait for the release coordinator to finish lunch.

The batching effect adds another layer of unpredictability. When teams batch changes to reduce deployment frequency, each deployment becomes larger and riskier. Larger deployments take longer to verify and are more likely to fail. The team cannot predict how long the deployment will take because they cannot predict what will go wrong with a batch of thirty changes.

Automated deployment makes the time from “pipeline green” to “running in production” fixed and predictable. It takes the same number of minutes regardless of who is available, what day it is, or how many other things are happening. The team can promise delivery timelines because the deployment is a deterministic process, not a human activity.

It prevents fast recovery

When production breaks, speed of recovery determines the blast radius. A team that can deploy a fix in five minutes limits the damage. A team that needs 45 minutes of manual deployment work exposes users to the problem for 45 minutes plus diagnosis time.

Manual rollback is even worse. Many teams with manual deployments have no practiced rollback procedure at all. “Rollback” means “re-deploy the previous version,” which means running the entire manual deployment process again with a different artifact. If the deployment process takes an hour, rollback takes an hour. If the deployment process requires a specific person, rollback requires that same person.

Some manual deployments cannot be cleanly rolled back. Database migrations that ran during the deployment may not have reverse scripts. Config changes applied to servers may not have been tracked. The team is left doing a forward fix under pressure, manually deploying a patch through the same slow process that caused the problem.

Automated pipelines with automated rollback can revert to the previous version in minutes. The rollback follows the same tested path as the deployment. No human judgment is required. The team’s mean time to repair drops from hours to minutes.

Impact on continuous delivery

Continuous delivery means any commit that passes the pipeline can be released to production at any time with confidence. Manual deployment breaks this definition at “at any time.” The commit can only be released when a human is available to perform the deployment, when the deployment window is open, and when the team is ready to dedicate attention to watching the process.

The manual deployment step is the bottleneck that limits everything upstream. The pipeline can validate commits in 10 minutes, but if deployment takes an hour of human effort, the team will never deploy more than a few times per day at best. In practice, teams with manual deployments release weekly or biweekly because the deployment overhead makes anything more frequent impractical.

The pipeline is only half the delivery system. Automating the build and tests without automating the deployment is like paving a highway that ends in a dirt road. The speed of the paved section is irrelevant if every journey ends with a slow, bumpy last mile.

How to Fix It

Step 1: Script the current manual process (Week 1)

Take the runbook, the checklist, or the knowledge in the deployer’s head and turn it into a script. Do not redesign the process yet - just encode what the team already does.

Record a deployment from start to finish. Note every command, every server, every check.
Write a script that executes those steps in order.
Store the script in version control alongside the application code.

The script will be rough. It will have hardcoded values and assumptions. That is fine. The goal is to make the deployment reproducible by any team member, not to make it perfect.

Step 2: Run the script from the pipeline (Week 2)

Connect the deployment script to the CI/CD pipeline so it runs automatically after the build and tests pass. Start with a non-production environment:

Add a deployment stage to the pipeline that targets a staging or test environment.
Trigger it automatically on every successful build.
Add a smoke test after deployment to verify it worked.

The team now gets automatic deployments to a non-production environment on every commit. This builds confidence in the automation and surfaces problems early.

Step 3: Externalize configuration and secrets (Weeks 2-3)

Manual deployments often involve editing config files on servers or passing environment-specific values by hand. Move these out of the manual process:

Store environment-specific configuration in a config management system or environment variables managed by the pipeline.
Move secrets to a secrets manager (Vault, AWS Secrets Manager, Azure Key Vault, or even encrypted pipeline variables as a starting point).
Ensure the deployment script reads configuration from these sources rather than from hardcoded values or manual input.

This step is critical because manual configuration is one of the most common sources of deployment failures. Automating deployment without automating configuration just moves the manual step.

Step 4: Automate production deployment with a gate (Weeks 3-4)

Extend the pipeline to deploy to production using the same script and process:

Add a production deployment stage after the non-production deployment succeeds.
Include a manual approval gate - a button that a team member clicks to authorize the production deployment. This is a temporary safety net while the team builds confidence.
Add post-deployment health checks that automatically verify the deployment succeeded.
Add automated rollback that triggers if the health checks fail.

The approval gate means a human still decides when to deploy, but the deployment itself is fully automated. No SSHing. No manual steps. No watching logs scroll by.

Step 5: Remove the manual gate (Weeks 6-8)

Once the team has seen the automated production deployment succeed repeatedly, remove the manual approval gate. The pipeline now deploys to production automatically when all checks pass.

This is the hardest step emotionally. The team will resist. Expect these objections:

Objection	Response
“We need a human to decide when to deploy”	Why? If the pipeline validates the code and the deployment process is automated and tested, what decision is the human making? If the answer is “checking that nothing looks weird,” that check should be automated.
“What if it deploys during peak traffic?”	Use deployment windows in the pipeline configuration, or use progressive rollout strategies that limit blast radius regardless of traffic.
“We had a bad deployment last month”	Was it caused by the automation or by a gap in testing? If the tests missed a defect, the fix is better tests, not a manual gate. If the deployment process itself failed, the fix is better deployment automation, not a human watching.
“Compliance requires manual approval”	Review the actual compliance requirement. Most require evidence of approval, not a human clicking a button at deployment time. A code review approval, an automated policy check, or an audit log of the pipeline run often satisfies the requirement.
“Our deployments require coordination with other teams”	Automate the coordination. Use API contracts, deployment dependencies in the pipeline, or event-based triggers. If another team must deploy first, encode that dependency rather than coordinating in Slack.

Step 6: Add deployment observability (Ongoing)

Once deployments are automated, invest in knowing whether they worked:

Monitor error rates, latency, and key business metrics after every deployment.
Set up automatic rollback triggers tied to these metrics.
Track deployment frequency, duration, and failure rate over time.

The team should be able to deploy without watching. The monitoring watches for them.

Measuring Progress

Metric	What to look for
Manual steps per deployment	Should reach zero
Deployment duration (human time)	Should drop from hours to zero - the pipeline does the work
Release frequency	Should increase as deployment friction drops
Change fail rate	Should decrease as manual process defects are eliminated
Mean time to repair	Should decrease as rollback becomes automated
Lead time	Should decrease as the deployment bottleneck is removed

Pipeline Architecture - How to structure a pipeline that includes deployment
Single Path to Production - Every change follows the same automated path through the same pipeline
Rollback - Automated rollback depends on automated deployment
Everything as Code - Deployment scripts, configuration, and infrastructure belong in version control
No Pipeline Exists - If the build is also manual, start there first

1.4.3 - Snowflake Environments

Each environment is hand-configured and unique. Nobody knows exactly what is running where. Configuration drift is constant.

Category: Pipeline & Infrastructure | Quality Impact: High

What This Looks Like

Staging has a different version of the database than production. The dev environment has a library installed that nobody remembers adding. Production has a configuration file that was edited by hand six months ago during an incident and never committed to source control. Nobody is sure all three environments are running the same OS patch level.

A developer asks “why does this work in staging but not in production?” The answer takes hours to find because it requires comparing configurations across environments by hand - diffing config files, checking installed packages, verifying environment variables one by one.

Common variations:

The hand-built server. Someone provisioned the production server two years ago. They followed a wiki page that has since been edited, moved, or deleted. Nobody has provisioned a new one since. If the server dies, nobody is confident they can recreate it.
The magic SSH session. During an incident, someone SSH-ed into production and changed a config value. It fixed the problem. Nobody updated the deployment scripts, the infrastructure code, or the documentation. The next deployment overwrites the fix - or doesn’t, depending on which files the deployment touches.
The shared dev environment. A single development or staging environment is shared by the whole team. One developer installs a library, another changes a config value, a third adds a cron job. The environment drifts from any known baseline within weeks.
The “production is special” mindset. Dev and staging environments are provisioned with scripts, but production was set up differently because of “security requirements” or “scale differences.” The result is that the environments the team tests against are structurally different from the one that serves users.
The environment with a name. Environments have names like “staging-v2” or “qa-new” because someone created a new one alongside the old one. Both still exist. Nobody is sure which one the pipeline deploys to.

The telltale sign: deploying the same artifact to two environments produces different results, and the team’s first instinct is to check environment configuration rather than application code.

Why This Is a Problem

Snowflake environments undermine the fundamental premise of testing: that the behavior you observe in one environment predicts the behavior you will see in another. When every environment is unique, testing in staging tells you what works in staging - nothing more.

It reduces quality

When environments differ, bugs hide in the gaps. An application that works in staging may fail in production because of a different library version, a missing environment variable, or a filesystem permission that was set by hand. These bugs are invisible to testing because the test environment does not reproduce the conditions that trigger them.

The team learns this the hard way, one production incident at a time. Each incident teaches the team that “passed in staging” does not mean “will work in production.” This erodes trust in the entire testing and deployment process. Developers start adding manual verification steps - checking production configs by hand before deploying, running smoke tests manually after deployment, asking the ops team to “keep an eye on things.”

When environments are identical and provisioned from the same code, the gap between staging and production disappears. What works in staging works in production because the environments are the same. Testing produces reliable results.

It increases rework

Snowflake environments cause two categories of rework. First, developers spend hours debugging environment-specific issues that have nothing to do with application code. “Why does this work on my machine but not in CI?” leads to comparing configurations, googling error messages related to version mismatches, and patching environments by hand. This time is pure waste.

Second, production incidents caused by environment drift require investigation, rollback, and fixes to both the application and the environment. A configuration difference that causes a production failure might take five minutes to fix once identified, but identifying it takes hours because nobody knows what the correct configuration should be.

Teams with reproducible environments spend zero time on environment debugging. If an environment is wrong, they destroy it and recreate it from code. The investigation time drops from hours to minutes.

It makes delivery timelines unpredictable

Deploying to a snowflake environment is unpredictable because the environment itself is an unknown variable. The same deployment might succeed on Monday and fail on Friday because someone changed something in the environment between the two deploys. The team cannot predict how long a deployment will take because they cannot predict what environment issues they will encounter.

This unpredictability compounds across environments. A change must pass through dev, staging, and production, and each environment is a unique snowflake with its own potential for surprise. A deployment that should take minutes takes hours because each environment reveals a new configuration issue.

Reproducible environments make deployment time a constant. The same artifact deployed to the same environment specification produces the same result every time. Deployment becomes a predictable step in the pipeline rather than an adventure.

It makes environments a scarce resource

When environments are hand-configured, creating a new one is expensive. It takes hours or days of manual work. The team has a small number of shared environments and must coordinate access. “Can I use staging today?” becomes a daily question. Teams queue up for access to the one environment that resembles production.

This scarcity blocks parallel work. Two developers who both need to test a database migration cannot do so simultaneously if there is only one staging environment. One waits while the other finishes. Features that could be validated in parallel are serialized through a shared environment bottleneck.

When environments are defined as code, spinning up a new one is a pipeline step that takes minutes. Each developer or feature branch can have its own environment. There is no contention because environments are disposable and cheap.

Impact on continuous delivery

Continuous delivery requires that any change can move from commit to production through a fully automated pipeline. Snowflake environments break this in multiple ways. The pipeline cannot provision environments automatically if environments are hand-configured. Testing results are unreliable because environments differ. Deployments fail unpredictably because of configuration drift.

A team with snowflake environments cannot trust their pipeline. They cannot deploy frequently because each deployment risks hitting an environment-specific issue. They cannot automate fully because the environments require manual intervention. The path from commit to production is neither continuous nor reliable.

How to Fix It

Step 1: Document what exists today (Week 1)

Before automating anything, capture the current state of each environment:

For each environment (dev, staging, production), record: OS version, installed packages, configuration files, environment variables, external service connections, and any manual customizations.
Diff the environments against each other. Note every difference.
Classify each difference as intentional (e.g., production uses a larger instance size) or accidental (e.g., staging has an old library version nobody updated).

This audit surfaces the drift. Most teams are surprised by how many accidental differences exist.

Step 2: Define one environment specification (Weeks 2-3)

Choose an infrastructure-as-code tool (Terraform, Pulumi, CloudFormation, Ansible, or similar) and write a specification for one environment. Start with the environment you understand best - usually staging.

The specification should define:

Base infrastructure (servers, containers, networking)
Installed packages and their versions
Configuration files and their contents
Environment variables with placeholder values
Any scripts that run at provisioning time

Verify the specification by destroying the staging environment and recreating it from code. If the recreated environment works, the specification is correct. If it does not, fix the specification until it does.

Step 3: Parameterize for environment differences (Week 3)

Intentional differences between environments (instance sizes, database connection strings, API keys) become parameters, not separate specifications. One specification with environment-specific variables:

Parameter	Dev	Staging	Production
Instance size	small	medium	large
Database host	dev-db.internal	staging-db.internal	prod-db.internal
Log level	debug	info	warn
Replica count	1	2	3

The structure is identical. Only the values change. This eliminates accidental drift because every environment is built from the same template.

Step 4: Provision environments through the pipeline (Week 4)

Add environment provisioning to the deployment pipeline:

Before deploying to an environment, the pipeline provisions (or updates) it from the infrastructure code.
The application artifact is deployed to the freshly provisioned environment.
If provisioning or deployment fails, the pipeline fails - no manual intervention.

This closes the loop. Environments cannot drift because they are recreated or reconciled on every deployment. Manual SSH sessions and hand edits have no lasting effect because the next pipeline run overwrites them.

Step 5: Make environments disposable (Week 5+)

The ultimate goal is that any environment can be destroyed and recreated in minutes with no data loss and no human intervention:

Practice destroying and recreating staging weekly. This verifies the specification stays accurate and builds team confidence.
Provision ephemeral environments for feature branches or pull requests. Let the pipeline create and destroy them automatically.
If recreating production is not feasible yet (stateful systems, licensing), ensure you can provision a production-identical environment for testing at any time.

Objection	Response
“Production has unique requirements we can’t codify”	If a requirement exists only in production and is not captured in code, it is at risk of being lost. Codify it. If it is truly unique, it belongs in a parameter, not a hand-edit.
“We don’t have time to learn infrastructure-as-code”	You are already spending that time debugging environment drift. The investment pays for itself within weeks. Start with the simplest tool that works for your platform.
“Our environments are managed by another team”	Work with them. Provide the specification. If they provision from your code, you both benefit: they have a reproducible process and you have predictable environments.
“Containers solve this problem”	Containers solve application-level consistency. You still need infrastructure-as-code for the platform the containers run on - networking, storage, secrets, load balancers. Containers are part of the solution, not the whole solution.

Measuring Progress

Metric	What to look for
Environment provisioning time	Should decrease from hours/days to minutes
Configuration differences between environments	Should reach zero accidental differences
“Works in staging but not production” incidents	Should drop to near zero
Change fail rate	Should decrease as environment parity improves
Mean time to repair	Should decrease as environments become reproducible
Time spent debugging environment issues	Track informally - should approach zero

Everything as Code - Infrastructure, configuration, and environments defined in source control
Production-Like Environments - Ensuring test environments match production
Pipeline Architecture - How environments fit into the deployment pipeline
No Pipeline Exists - Snowflake environments often coexist with manual deployment processes
Deterministic Pipeline - A pipeline that gives the same answer every time requires identical environments

1.5 - Organizational and Cultural

Anti-patterns in team culture, management practices, and organizational structure that block continuous delivery.

These anti-patterns affect the human and organizational side of delivery. They create misaligned incentives, erode trust, and block the cultural changes that continuous delivery requires. Technical practices alone cannot overcome a culture that works against them.

1.5.1 - Change Advisory Board Gates

Manual committee approval required for every production change. Meetings are weekly. One-line fixes wait alongside major migrations.

Category: Organizational & Cultural | Quality Impact: High

What This Looks Like

Before any change can reach production, it must be submitted to the Change Advisory Board. The developer fills out a change request form: description of the change, impact assessment, rollback plan, testing evidence, and approval signatures. The form goes into a queue. The CAB meets once a week - sometimes every two weeks - to review the queue. Each change gets a few minutes of discussion. The board approves, rejects, or requests more information.

A one-line configuration fix that a developer finished on Monday waits until Thursday’s CAB meeting. If the board asks a question, the change waits until the next meeting. A two-line bug fix sits in the same queue as a database migration, reviewed by the same people with the same ceremony.

Common variations:

The rubber-stamp CAB. The board approves everything. Nobody reads the change requests carefully because the volume is too high and the context is too shallow. The meeting exists to satisfy an audit requirement, not to catch problems. It adds delay without adding safety.
The bottleneck approver. One person on the CAB must approve every change. That person is in six other meetings, has 40 pending reviews, and is on vacation next week. Deployments stop when they are unavailable.
The emergency change process. Urgent fixes bypass the CAB through an “emergency change” procedure that requires director-level approval and a post-hoc review. The emergency process is faster, so teams learn to label everything urgent. The CAB process is for scheduled changes, and fewer changes are scheduled.
The change freeze. Certain periods - end of quarter, major events, holidays - are declared change-free zones. No production changes for days or weeks. Changes pile up during the freeze and deploy in a large batch afterward, which is exactly the high-risk event the freeze was meant to prevent.
The form-driven process. The change request template has 15 fields, most of which are irrelevant for small changes. Developers spend more time filling out the form than making the change. Some fields require information the developer does not have, so they make something up.

The telltale sign: a developer finishes a change and says “now I need to submit it to the CAB” with the same tone they would use for “now I need to go to the dentist.”

Why This Is a Problem

CAB gates exist to reduce risk. In practice, they increase risk by creating delay, encouraging batching, and providing a false sense of security. The review is too shallow to catch real problems and too slow to enable fast delivery.

It reduces quality

A CAB review is a review by people who did not write the code, did not test it, and often do not understand the system it affects. A board member scanning a change request form for five minutes cannot assess the quality of a code change. They can check that the form is filled out. They cannot check that the change is safe.

The real quality checks - automated tests, code review by peers, deployment verification - happen before the CAB sees the change. The CAB adds nothing to quality because it reviews paperwork, not code. The developer who wrote the tests and the reviewer who read the diff know far more about the change’s risk than a board member reading a summary.

Meanwhile, the delay the CAB introduces actively harms quality. A bug fix that is ready on Monday but cannot deploy until Thursday means users experience the bug for three extra days. A security patch that waits for weekly approval is a vulnerability window measured in days.

Teams without CAB gates deploy quality checks into the pipeline itself: automated tests, security scans, peer review, and deployment verification. These checks are faster, more thorough, and more reliable than a weekly committee meeting.

It increases rework

The CAB process generates significant administrative overhead. For every change, a developer must write a change request, gather approval signatures, and attend (or wait for) the board meeting. This overhead is the same whether the change is a one-line typo fix or a major feature.

When the CAB requests more information or rejects a change, the cycle restarts. The developer updates the form, resubmits, and waits for the next meeting. A change that was ready to deploy a week ago sits in a review loop while the developer has moved on to other work. Picking it back up costs context-switching time.

The batching effect creates its own rework. When changes are delayed by the CAB process, they accumulate. Developers merge multiple changes to avoid submitting multiple requests. Larger batches are harder to review, harder to test, and more likely to cause problems. When a problem occurs, it is harder to identify which change in the batch caused it.

It makes delivery timelines unpredictable

The CAB introduces a fixed delay into every deployment. If the board meets weekly, the minimum time from “change ready” to “change deployed” is up to a week, depending on when the change was finished relative to the meeting schedule. This delay is independent of the change’s size, risk, or urgency.

The delay is also variable. A change submitted on Monday might be approved Thursday. A change submitted on Friday waits until the following Thursday. If the board requests revisions, add another week. Developers cannot predict when their change will reach production because the timeline depends on a meeting schedule and a queue they do not control.

This unpredictability makes it impossible to make reliable commitments. When a stakeholder asks “when will this be live?” the developer must account for development time plus an unpredictable CAB delay. The answer becomes “sometime in the next one to three weeks” for a change that took two hours to build.

It creates a false sense of security

The most dangerous effect of the CAB is the belief that it prevents incidents. It does not. The board reviews paperwork, not running systems. A well-written change request for a dangerous change will be approved. A poorly written request for a safe change will be questioned. The correlation between CAB approval and deployment safety is weak at best.

Studies of high-performing delivery organizations consistently show that external change approval processes do not reduce failure rates. The 2019 Accelerate State of DevOps Report found that teams with external change approval had higher failure rates than teams using peer review and automated checks. The CAB provides a feeling of control without the substance.

This false sense of security is harmful because it displaces investment in controls that actually work. If the organization believes the CAB prevents incidents, there is less pressure to invest in automated testing, deployment verification, and progressive rollout - the controls that actually reduce deployment risk.

Impact on continuous delivery

Continuous delivery requires that any change can reach production quickly through an automated pipeline. A weekly approval meeting is fundamentally incompatible with continuous deployment.

The math is simple. If the CAB meets weekly and reviews 20 changes per meeting, the maximum deployment frequency is 20 per week. A team practicing CD might deploy 20 times per day. The CAB process reduces deployment frequency by two orders of magnitude.

More importantly, the CAB process assumes that human review of change requests is a meaningful quality gate. CD assumes that automated checks - tests, security scans, deployment verification - are better quality gates because they are faster, more consistent, and more thorough. These are incompatible philosophies. A team practicing CD replaces the CAB with pipeline-embedded controls that provide equivalent (or superior) risk management without the delay.

How to Fix It

Eliminating the CAB outright is rarely possible because it exists to satisfy regulatory or organizational governance requirements. The path forward is to replace the manual ceremony with automated controls that satisfy the same requirements faster and more reliably.

Step 1: Classify changes by risk (Week 1)

Not all changes carry the same risk. Introduce a risk classification:

Risk level	Criteria	Example	Approval process
Standard	Small, well-tested, automated rollback	Config change, minor bug fix, dependency update	Peer review + passing pipeline = auto-approved
Normal	Medium scope, well-tested	New feature behind a feature flag, API endpoint addition	Peer review + passing pipeline + team lead sign-off
High	Large scope, architectural, or compliance-sensitive	Database migration, authentication change, PCI-scoped change	Peer review + passing pipeline + architecture review

The goal is to route 80-90% of changes through the standard process, which requires no CAB involvement at all.

Step 2: Define pipeline controls that replace CAB review (Weeks 2-3)

For each concern the CAB currently addresses, implement an automated alternative:

CAB concern	Automated replacement
“Will this change break something?”	Automated test suite with high coverage, pipeline-gated
“Is there a rollback plan?”	Automated rollback built into the deployment pipeline
“Has this been tested?”	Test results attached to every change as pipeline evidence
“Is this change authorized?”	Peer code review with approval recorded in version control
“Do we have an audit trail?”	Pipeline logs capture who changed what, when, with what test results

Document these controls. They become the evidence that satisfies auditors in place of the CAB meeting minutes.

Step 3: Pilot auto-approval for standard changes (Week 3)

Pick one team or one service as a pilot. Standard-risk changes from that team bypass the CAB entirely if they meet the automated criteria:

Code review approved by at least one peer.
All pipeline stages passed (build, test, security scan).
Change classified as standard risk.
Deployment includes automated health checks and rollback capability.

Track the results: deployment frequency, change fail rate, and incident count. Compare with the CAB-gated process.

Step 4: Present the data and expand (Weeks 4-8)

After a month of pilot data, present the results to the CAB and organizational leadership:

How many changes were auto-approved?
What was the change fail rate for auto-approved changes vs. CAB-reviewed changes?
How much faster did auto-approved changes reach production?
How many incidents were caused by auto-approved changes?

If the data shows that auto-approved changes are as safe or safer than CAB-reviewed changes (which is the typical outcome), expand the auto-approval process to more teams and more change types.

Step 5: Reduce the CAB to high-risk changes only (Week 8+)

With most changes flowing through automated approval, the CAB’s scope shrinks to genuinely high-risk changes: major architectural shifts, compliance-sensitive changes, and cross-team infrastructure modifications. These changes are infrequent enough that a review process is not a bottleneck.

The CAB meeting frequency drops from weekly to as-needed. The board members spend their time on changes that actually benefit from human review rather than rubber-stamping routine deployments.

Objection	Response
“The CAB is required by our compliance framework”	Most compliance frameworks (SOX, PCI, HIPAA) require separation of duties and change control, not a specific meeting. Automated pipeline controls with audit trails satisfy the same requirements. Engage your auditors early to confirm.
“Without the CAB, anyone could deploy anything”	The pipeline controls are stricter than the CAB. The CAB reviews a form for five minutes. The pipeline runs thousands of tests, security scans, and verification checks. Auto-approval is not no-approval - it is better approval.
“We’ve always done it this way”	The CAB was designed for a world of monthly releases. In that world, reviewing 10 changes per month made sense. In a CD world with 10 changes per day, the same process becomes a bottleneck that adds risk instead of reducing it.
“What if an auto-approved change causes an incident?”	What if a CAB-approved change causes an incident? (They do.) The question is not whether incidents happen but how quickly you detect and recover. Automated deployment verification and rollback detect and recover faster than any manual process.

Measuring Progress

Metric	What to look for
Lead time	Should decrease as CAB delay is removed for standard changes
Release frequency	Should increase as deployment is no longer gated on weekly meetings
Change fail rate	Should remain stable or decrease - proving auto-approval is safe
Percentage of changes auto-approved	Should climb toward 80-90%
CAB meeting frequency	Should decrease from weekly to as-needed
Time from “ready to deploy” to “deployed”	Should drop from days to hours or minutes

Single Path to Production - The pipeline replaces manual gates
Deterministic Pipeline - Automated controls that provide consistent quality checks
Rollback - Automated rollback replaces manual rollback plans in change requests
Metrics-Driven Improvement - Using data to prove that automated controls work
Deploy on Demand - The end state where any change can deploy when ready

1.5.2 - Pressure to Skip Testing

Management pressures developers to skip or shortcut testing to meet deadlines. The test suite rots sprint by sprint as skipped tests become the norm.

Category: Organizational & Cultural | Quality Impact: High

What This Looks Like

A deadline is approaching. The manager asks the team how things are going. A developer says the feature is done but the tests still need to be written. The manager says “we’ll come back to the tests after the release.” The tests are never written. Next sprint, the same thing happens. After a few months, the team has a codebase with patches of coverage surrounded by growing deserts of untested code.

Nobody made a deliberate decision to abandon testing. It happened one shortcut at a time, each one justified by a deadline that felt more urgent than the test suite.

Common variations:

“Tests are a nice-to-have.” The team treats test writing as optional scope that gets cut when time is short. Features are estimated without testing time. Tests are a separate backlog item that never reaches the top.
“We’ll add tests in the hardening sprint.” Testing is deferred to a future sprint dedicated to quality. That sprint gets postponed, shortened, or filled with the next round of urgent features. The testing debt compounds.
“Just get it out the door.” A manager or product owner explicitly tells developers to skip tests for a specific release. The implicit message is that shipping matters and quality does not. Developers who push back are seen as slow or uncooperative.
The coverage ratchet in reverse. The team once had 70% test coverage. Each sprint, a few untested changes slip through. Coverage drops to 60%, then 50%, then 40%. Nobody notices the trend because each individual drop is small. By the time someone looks at the number, half the safety net is gone.
Testing theater. Developers write the minimum tests needed to pass a coverage gate - trivial assertions, tests that verify getters and setters, tests that do not actually exercise meaningful behavior. The coverage number looks healthy but the tests catch nothing.

The telltale sign: the team has a backlog of “write tests for X” tickets that are months old and have never been started, while production incidents keep increasing.

Why This Is a Problem

Skipping tests feels like it saves time in the moment. It does not. It borrows time from the future at a steep interest rate. The effects are invisible at first and catastrophic later.

It reduces quality

Every untested change is a change that nobody can verify automatically. The first few skipped tests are low risk - the code is fresh in the developer’s mind and unlikely to break. But as weeks pass, the untested code is modified by other developers who do not know the original intent. Without tests to pin the behavior, regressions creep in undetected.

The damage accelerates. When half the codebase is untested, developers cannot tell which changes are safe and which are risky. They treat every change as potentially dangerous, which slows them down. Or they treat every change as probably fine, which lets bugs through. Either way, quality suffers.

Teams that maintain their test suite catch regressions within minutes of introducing them. The developer who caused the regression fixes it immediately because they are still working on the relevant code. The cost of the fix is minutes, not days.

It increases rework

Untested code generates rework in two forms. First, bugs that would have been caught by tests reach production and must be investigated, diagnosed, and fixed under pressure. A bug found by a test costs minutes to fix. The same bug found in production costs hours - plus the cost of the incident response, the rollback or hotfix, and the customer impact.

Second, developers working in untested areas of the codebase move slowly because they have no safety net. They make a change, manually verify it, discover it broke something else, revert, try again. Work that should take an hour takes a day because every change requires manual verification.

The rework is invisible in sprint metrics. The team does not track “time spent debugging issues that tests would have caught.” But it shows up in velocity: the team ships less and less each sprint even as they work longer hours.

It makes delivery timelines unpredictable

When the test suite is healthy, the time from “code complete” to “deployed” is a known quantity. The pipeline runs, tests pass, the change ships. When the test suite has been hollowed out by months of skipped tests, that step becomes unpredictable. Some changes pass cleanly. Others trigger production incidents that take days to resolve.

The manager who pressured the team to skip tests in order to hit a deadline ends up with less predictable timelines, not more. Each skipped test is a small increase in the probability that a future change will cause an unexpected failure. Over months, the cumulative probability climbs until production incidents become a regular occurrence rather than an exception.

Teams with comprehensive test suites deliver predictably because the automated checks eliminate the largest source of variance - undetected defects.

It creates a death spiral

The most dangerous aspect of this anti-pattern is that it is self-reinforcing. Skipping tests leads to more bugs. More bugs lead to more time spent firefighting. More time firefighting means less time for testing. Less testing means more bugs. The cycle accelerates.

At the same time, the codebase becomes harder to test. Code written without tests in mind tends to be tightly coupled, dependent on global state, and difficult to isolate. The longer testing is deferred, the more expensive it becomes to add tests later. The team’s estimate for “catching up on testing” grows from days to weeks to months, making it even less likely that management will allocate the time.

Eventually, the team reaches a state where the test suite is so degraded that it provides no confidence. The team is effectively back to no test automation but with the added burden of maintaining a broken test infrastructure that nobody trusts.

Impact on continuous delivery

Continuous delivery requires automated quality gates that the team can rely on. A test suite that has been eroded by months of skipped tests is not a quality gate - it is a gate with widening holes. Changes pass through it not because they are safe but because the tests that would have caught the problems were never written.

A team cannot deploy continuously if they cannot verify continuously. When the manager says “skip the tests, we need to ship,” they are not just deferring quality work. They are dismantling the infrastructure that makes frequent, safe deployment possible.

How to Fix It

Step 1: Make the cost visible (Week 1)

The pressure to skip tests comes from a belief that testing is overhead rather than investment. Change that belief with data:

Count production incidents in the last 90 days. For each one, identify whether an automated test could have caught it. Calculate the total hours spent on incident response.
Measure the team’s change fail rate - the percentage of deployments that cause a failure or require a rollback.
Track how long manual verification takes per release. Sum the hours across the team.

Present these numbers to the manager applying pressure. Frame it concretely: “We spent 40 hours on incident response last quarter. Thirty of those incidents would have been caught by tests that we skipped.”

Step 2: Include testing in every estimate (Week 2)

Stop treating tests as separate work items that can be deferred:

Agree as a team: no story is “done” until it has automated tests. This is a working agreement, not a suggestion.
Include testing time in every estimate. If a feature takes three days to build, the estimate is three days - including tests. Testing is not additive; it is part of building the feature.
Stop creating separate “write tests” tickets. Tests are part of the story, not a follow-up task.

When a manager asks “can we skip the tests to ship faster?” the answer is “the tests are part of shipping. Skipping them means the feature is not done.”

Step 3: Set a coverage floor and enforce it (Week 3)

Prevent further erosion with an automated guardrail:

Measure current test coverage. Whatever it is - 30%, 50%, 70% - that is the floor.
Configure the pipeline to fail if a change reduces coverage below the floor.
Ratchet the floor up by 1-2 percentage points each month.

The floor makes the cost of skipping tests immediate and visible. A developer who skips tests will see the pipeline fail. The conversation shifts from “we’ll add tests later” to “the pipeline won’t let us merge without tests.”

Step 4: Recover coverage in high-risk areas (Weeks 3-6)

You cannot test everything retroactively. Prioritize the areas that matter most:

Use version control history to find the files with the most changes and the most bug fixes. These are the highest-risk areas.
For each high-risk file, write tests for the core behavior - the functions that other code depends on.
Allocate a fixed percentage of each sprint (e.g., 20%) to writing tests for existing code. This is not optional and not deferrable.

Step 5: Address the management pressure directly (Ongoing)

The root cause is a manager who sees testing as optional. This requires a direct conversation:

What the manager says	What to say back
“We don’t have time for tests”	“We don’t have time for the production incidents that skipping tests causes. Last quarter, incidents cost us X hours.”
“Just this once, we’ll catch up later”	“We said that three sprints ago. Coverage has dropped from 60% to 45%. There is no ’later’ unless we stop the bleeding now.”
“The customer needs this feature by Friday”	“The customer also needs the application to work. Shipping an untested feature on Friday and a hotfix on Monday does not save time.”
“Other teams ship without this many tests”	“Other teams with similar practices have a change fail rate of X%. Ours is Y%. The tests are why.”

If the manager continues to apply pressure after seeing the data, escalate. Test suite erosion is a technical risk that affects the entire organization’s ability to deliver. It is appropriate to raise it with engineering leadership.

Measuring Progress

Metric	What to look for
Test coverage trend	Should stop declining and begin climbing
Change fail rate	Should decrease as coverage recovers
Production incidents from untested code	Track root causes - “no test coverage” should become less frequent
Stories completed without tests	Should drop to zero
Development cycle time	Should stabilize as manual verification decreases
Sprint capacity spent on incident response	Should decrease as fewer untested changes reach production

Testing Fundamentals - Building a test strategy that becomes part of how the team works
Working Agreements - Making “done includes tests” an explicit team agreement
No Test Automation - Where this anti-pattern ends up if left unchecked
Flaky Test Suites - Another way trust in the test suite erodes
Metrics-Driven Improvement - Using data to make the case for quality practices

1.6 - Monitoring and Observability

Anti-patterns in monitoring, alerting, and observability that block continuous delivery.

These anti-patterns affect the team’s ability to see what is happening in production. They create blind spots that make deployment risky, incident response slow, and confidence in the delivery pipeline impossible to build.

1.6.1 - No Observability

The team cannot tell if a deployment is healthy. No metrics, no log aggregation, no tracing. Issues are discovered when customers call support.

Category: Monitoring & Observability | Quality Impact: High

What This Looks Like

The team deploys a change. Someone asks “is it working?” Nobody knows. There is no dashboard to check. There are no metrics to compare before and after. The team waits. If nobody complains within an hour, they assume the deployment was successful.

When something does go wrong, the team finds out from a customer support ticket, a Slack message from another team, or an executive asking why the site is slow. The investigation starts with SSH-ing into a server and reading raw log files. Hours pass before anyone understands what happened, what caused it, or how many users were affected.

Common variations:

Logs exist but are not aggregated. Each server writes its own log files. Debugging requires logging into multiple servers and running grep. Correlating a request across services means opening terminals to five machines and searching by timestamp.
Metrics exist but nobody watches them. A monitoring tool was set up once. It has default dashboards for CPU and memory. Nobody configured application-level metrics. The dashboards show that servers are running, not whether the application is working.
Alerting is all or nothing. Either there are no alerts, or there are hundreds of noisy alerts that the team ignores. Real problems are indistinguishable from false alarms. The on-call person mutes their phone.
Observability is someone else’s job. A separate operations or platform team owns the monitoring tools. The development team does not have access, does not know what is monitored, and does not add instrumentation to their code.
Post-deployment verification is manual. After every deployment, someone clicks through the application to check if it works. This takes 15 minutes per deployment. It catches obvious failures but misses performance degradation, error rate increases, and partial outages.

The telltale sign: the team’s primary method for detecting production problems is waiting for someone outside the team to report them.

Why This Is a Problem

Without observability, the team is deploying into a void. They cannot verify that deployments are healthy, cannot detect problems quickly, and cannot diagnose issues when they arise. Every deployment is a bet that nothing will go wrong, with no way to check.

It reduces quality

When the team cannot see the effects of their changes in production, they cannot learn from them. A deployment that degrades response times by 200 milliseconds goes unnoticed. A change that causes a 2% increase in error rates is invisible. These small quality regressions accumulate because nobody can see them.

Without production telemetry, the team also loses the most valuable feedback loop: how the software actually behaves under real load with real data. A test suite can verify logic, but only production observability reveals performance characteristics, usage patterns, and failure modes that tests cannot simulate.

Teams with strong observability catch regressions within minutes of deployment. They see error rate spikes, latency increases, and anomalous behavior in real time. They roll back or fix the issue before most users are affected. Quality improves because the feedback loop from deployment to detection is minutes, not days.

It increases rework

Without observability, incidents take longer to detect, longer to diagnose, and longer to resolve. Each phase of the incident lifecycle is extended because the team is working blind.

Detection takes hours or days instead of minutes because the team relies on external reports. Diagnosis takes hours instead of minutes because there are no traces, no correlated logs, and no metrics to narrow the search. The team resorts to reading code and guessing. Resolution takes longer because without metrics, the team cannot verify that their fix actually worked - they deploy the fix and wait to see if the complaints stop.

A team with observability detects problems in minutes through automated alerts, diagnoses them in minutes by following traces and examining metrics, and verifies fixes instantly by watching dashboards. The total incident lifecycle drops from hours to minutes.

It makes delivery timelines unpredictable

Without observability, the team cannot assess deployment risk. They do not know the current error rate, the baseline response time, or the system’s capacity. Every deployment might trigger an incident that consumes the rest of the day, or it might go smoothly. The team cannot predict which.

This uncertainty makes the team cautious. They deploy less frequently because each deployment is a potential fire. They avoid deploying on Fridays, before holidays, or before important events. They batch up changes so there are fewer risky deployment moments. Each of these behaviors slows delivery and increases batch size, which increases risk further.

Teams with observability deploy with confidence because they can verify health immediately. A deployment that causes a problem is detected and rolled back in minutes. The blast radius is small because the team catches issues before they spread. This confidence enables frequent deployment, which keeps batch sizes small, which reduces risk.

Impact on continuous delivery

Continuous delivery requires fast feedback from production. The deploy-and-verify cycle must be fast enough that the team can deploy many times per day with confidence. Without observability, there is no verification step - only hope.

Specifically, CD requires:

Automated deployment verification. After every deployment, the pipeline must verify that the new version is healthy before routing traffic to it. This requires health checks, metric comparisons, and automated rollback triggers - all of which require observability.
Fast incident detection. If a deployment causes a problem, the team must know within minutes, not hours. Automated alerts based on error rates, latency, and business metrics are essential.
Confident rollback decisions. When a deployment looks unhealthy, the team must be able to compare current metrics to the baseline and make a data-driven rollback decision. Without metrics, rollback decisions are based on gut feeling and anecdote.

A team without observability can automate deployment, but they cannot automate verification. That means every deployment requires manual checking, which caps deployment frequency at whatever pace the team can manually verify.

How to Fix It

Step 1: Add structured logging (Week 1)

Structured logging is the foundation of observability. Without it, logs are unreadable at scale.

Replace unstructured log statements (log("processing order")) with structured ones (log(event="order.processed", order_id=123, duration_ms=45)).
Include a correlation ID in every log entry so that all log entries for a single request can be linked together across services.
Send logs to a central aggregation service (Elasticsearch, Datadog, CloudWatch, Loki, or similar). Stop relying on SSH and grep.

Focus on the most critical code paths first: request handling, error paths, and external service calls. You do not need to instrument everything in week one.

Step 2: Add application-level metrics (Week 2)

Infrastructure metrics (CPU, memory, disk) tell you the servers are running. Application metrics tell you the software is working. Add the four golden signals:

Signal	What to measure	Example
Latency	How long requests take	p50, p95, p99 response time per endpoint
Traffic	How much demand the system handles	Requests per second, messages processed per minute
Errors	How often requests fail	Error rate by endpoint, HTTP 5xx count
Saturation	How full the system is	Queue depth, connection pool usage, thread count

Expose these metrics through your application (using Prometheus client libraries, StatsD, or your platform’s metric SDK) and visualize them on a dashboard.

Step 3: Create a deployment health dashboard (Week 3)

Build a single dashboard that answers: “Is the system healthy right now?”

Include the four golden signals from Step 2.
Add deployment markers so the team can see when deploys happened and correlate them with metric changes.
Include business metrics that matter: successful checkouts per minute, sign-ups per hour, or whatever your system’s key transactions are.

This dashboard becomes the first thing the team checks after every deployment. It replaces the manual click-through verification.

Step 4: Add automated alerts for deployment verification (Week 4)

Move from “someone checks the dashboard” to “the system tells us when something is wrong”:

Set alert thresholds based on your baseline metrics. If the p95 latency is normally 200ms, alert when it exceeds 500ms for more than 2 minutes.
Set error rate alerts. If the error rate is normally below 1%, alert when it crosses 5%.
Connect alerts to the team’s communication channel (Slack, PagerDuty, or similar). Alerts must reach the people who can act on them.

Start with a small number of high-confidence alerts. Three alerts that fire reliably are worth more than thirty that the team ignores.

Step 5: Integrate observability into the deployment pipeline (Week 5+)

Close the loop between deployment and verification:

After deploying, the pipeline waits and checks health metrics automatically. If error rates spike or latency degrades beyond the threshold, the pipeline triggers an automatic rollback.
Add smoke tests that run against the live deployment and report results to the dashboard.
Implement canary deployments or progressive rollouts that route a small percentage of traffic to the new version and compare its metrics against the baseline before promoting.

This is the point where observability enables continuous delivery. The pipeline can deploy with confidence because it can verify health automatically.

Objection	Response
“We don’t have budget for monitoring tools”	Open-source stacks (Prometheus, Grafana, Loki, Jaeger) provide full observability at zero license cost. The investment is setup time, not money.
“We don’t have time to add instrumentation”	Start with the deployment health dashboard. One afternoon of work gives the team more production visibility than they have ever had. Build from there.
“The ops team handles monitoring”	Observability is a development concern, not just an operations concern. Developers write the code that generates the telemetry. They need access to the dashboards and alerts.
“We’ll add observability after we stabilize”	You cannot stabilize what you cannot see. Observability is how you find stability problems. Adding it later means flying blind longer.

Measuring Progress

Metric	What to look for
Mean time to detect (MTTD)	Time from problem occurring to team being aware - should drop from hours to minutes
Mean time to repair	Should decrease as diagnosis becomes faster
Manual verification time per deployment	Should drop to zero as automated checks replace manual click-throughs
Change fail rate	Should decrease as deployment verification catches problems before they reach users
Alert noise ratio	Percentage of alerts that are actionable - should be above 80%
Incidents discovered by customers vs. by the team	Ratio should shift toward team detection

Pipeline Architecture - Where deployment verification fits in the pipeline
Rollback - Observability enables data-driven rollback decisions
Progressive Rollout - Canary deployments require metric comparison
Metrics-Driven Improvement - Using production data to guide improvement
Baseline Metrics - Establishing the numbers you need before you can improve them

1.7 - Architecture

Anti-patterns in system architecture and design that block continuous delivery.

These anti-patterns affect the structure of the software itself. They create coupling that makes independent deployment impossible, blast radii that make every change risky, and boundaries that force teams to coordinate instead of delivering independently.

1.7.1 - Tightly Coupled Monolith

Changing one module breaks others. No clear boundaries. Every change is high-risk because blast radius is unpredictable.

Category: Architecture | Quality Impact: High

What This Looks Like

A developer changes a function in the order processing module. The test suite fails in the reporting module, the notification service, and a batch job that nobody knew existed. The developer did not touch any of those systems. They changed one function in one file, and three unrelated features broke.

The team has learned to be cautious. Before making any change, developers trace every caller, every import, and every database query that might be affected. A change that should take an hour takes a day because most of the time is spent figuring out what might break. Even after that analysis, surprises are common.

Common variations:

The web of shared state. Multiple modules read and write the same database tables directly. A schema change in one module breaks queries in five others. Nobody owns the tables because everybody uses them.
The god object. A single class or module that everything depends on. It handles authentication, logging, database access, and business logic. Changing it is terrifying because the entire application runs through it.
Transitive dependency chains. Module A depends on Module B, which depends on Module C. A change to Module C breaks Module A through a chain that nobody can trace without a debugger. The dependency graph is a tangle, not a tree.
Shared libraries with hidden contracts. Internal libraries used by multiple modules with no versioning or API stability guarantees. Updating the library for one consumer breaks another. Teams stop updating shared libraries because the risk is too high.
Everything deploys together. The application is a single deployable unit. Even if modules are logically separated in the source code, they compile and ship as one artifact. A one-line change to the login page requires deploying the entire system.

The telltale sign: developers regularly say “I don’t know what this change will affect” and mean it. Changes routinely break features that seem unrelated.

Why This Is a Problem

Tight coupling turns every change into a gamble. The cost of a change is not proportional to its size but to the number of hidden dependencies it touches. Small changes carry large risk, which slows everything down.

It reduces quality

When every change can break anything, developers cannot reason about the impact of their work. A well-bounded module lets a developer think locally: “I changed the discount calculation, so discount-related behavior might be affected.” A tightly coupled system offers no such guarantee. The discount calculation might share a database table with the shipping module, which triggers a notification workflow, which updates a dashboard.

This unpredictable blast radius makes code review less effective. Reviewers can verify that the code in the diff is correct, but they cannot verify that it is safe. The breakage happens in code that is not in the diff - code that neither the author nor the reviewer thought to check.

In a system with clear module boundaries, the blast radius of a change is bounded by the module’s interface. If the interface does not change, nothing outside the module can break. Developers and reviewers can focus on the module itself and trust the boundary.

It increases rework

Tight coupling causes rework in two ways. First, unexpected breakage from seemingly safe changes sends developers back to fix things they did not intend to touch. A one-line change that breaks the notification system means the developer now needs to understand and fix the notification system before their original change can ship.

Second, developers working in different parts of the codebase step on each other. Two developers changing different modules unknowingly modify the same shared state. Both changes work individually but conflict when merged. The merge succeeds at the code level but fails at runtime because the shared state cannot satisfy both changes simultaneously. These bugs are expensive to find because the failure only manifests when both changes are present.

Systems with clear boundaries minimize this interference. Each module owns its data and exposes it through explicit interfaces. Two developers working in different modules cannot create a hidden conflict because there is no shared mutable state to conflict on.

It makes delivery timelines unpredictable

In a coupled system, the time to deliver a change includes the time to understand the impact, make the change, fix the unexpected breakage, and retest everything that might be affected. The first and third steps are unpredictable because no one knows the full dependency graph.

A developer estimates a task at two days. On day one, the change is made and tests are passing. On day two, a failing test in another module reveals a hidden dependency. Fixing the dependency takes two more days. The task that was estimated at two days takes four. This happens often enough that the team stops trusting estimates, and stakeholders stop trusting timelines.

The testing cost is also unpredictable. In a modular system, changing Module A means running Module A’s tests. In a coupled system, changing anything might mean running everything. If the full test suite takes 30 minutes, every small change requires a 30-minute feedback cycle because there is no way to scope the impact.

It prevents independent team ownership

When the codebase is a tangle of dependencies, no team can own a module cleanly. Every change in one team’s area risks breaking another team’s area. Teams develop informal coordination rituals: “Let us know before you change the order table.” “Don’t touch the shared utils module without talking to Platform first.”

These coordination costs scale quadratically with the number of teams. Two teams need one communication channel. Five teams need ten. Ten teams need forty-five. The result is that adding developers makes the system slower to change, not faster.

In a system with well-defined module boundaries, each team owns their modules and their data. They deploy independently. They do not need to coordinate on internal changes because the boundaries prevent cross-module breakage. Communication focuses on interface changes, which are infrequent and explicit.

Impact on continuous delivery

Continuous delivery requires that any change can flow from commit to production safely and quickly. Tight coupling breaks this in multiple ways:

Blast radius prevents small, safe changes. If a one-line change can break unrelated features, no change is small from a risk perspective. The team compensates by batching changes and testing extensively, which is the opposite of continuous.
Testing scope is unbounded. Without module boundaries, there is no way to scope testing to the changed area. Every change requires running the full suite, which slows the pipeline and reduces deployment frequency.
Independent deployment is impossible. If everything must deploy together, deployment coordination is required. Teams queue up behind each other. Deployment frequency is limited by the slowest team.
Rollback is risky. Rolling back one change might break something else if other changes were deployed simultaneously. The tangle works in both directions.

A team with a tightly coupled monolith can still practice CD, but they must invest in decoupling first. Without boundaries, the feedback loops are too slow and the blast radius is too large for continuous deployment to be safe.

How to Fix It

Decoupling a monolith is a long-term effort. The goal is not to rewrite the system or extract microservices on day one. The goal is to create boundaries that limit blast radius and enable independent change. Start where the pain is greatest.

Step 1: Map the dependency hotspots (Week 1)

Identify the areas of the codebase where coupling causes the most pain:

Use version control history to find the files that change together most frequently. Files that always change as a group are likely coupled.
List the modules or components that are most often involved in unexpected test failures after changes to other areas.
Identify shared database tables - tables that are read or written by more than one module.
Draw the dependency graph. Tools like dependency-cruiser (JavaScript), jdepend (Java), or similar can automate this. Look for cycles and high fan-in nodes.

Rank the hotspots by pain: which coupling causes the most unexpected breakage, the most coordination overhead, or the most test failures?

Step 2: Define module boundaries on paper (Week 2)

Before changing any code, define where boundaries should be:

Group related functionality into candidate modules based on business domain, not technical layer. “Orders,” “Payments,” and “Notifications” are better boundaries than “Database,” “API,” and “UI.”
For each boundary, define what the public interface would be: what data crosses the boundary and in what format?
Identify shared state that would need to be split or accessed through interfaces.

This is a design exercise, not an implementation. The output is a diagram showing target module boundaries with their interfaces.

Step 3: Enforce one boundary (Weeks 3-6)

Pick the boundary with the best ratio of pain-reduced to effort-required and enforce it in code:

Create an explicit interface (API, function contract, or event) for cross-module communication. All external callers must use the interface.
Move shared database access behind the interface. If the payments module needs order data, it calls the orders module’s interface rather than querying the orders table directly.
Add a build-time or lint-time check that enforces the boundary. Fail the build if code outside the module imports internal code directly.

This is the hardest step because it requires changing existing call sites. Use the Strangler Fig approach: create the new interface alongside the old coupling, migrate callers one at a time, and remove the old path when all callers have migrated.

Step 4: Scope testing to module boundaries (Week 4+)

Once a boundary exists, use it to scope testing:

Write tests for the module’s public interface (contract tests and functional tests).
Changes within the module only need to run the module’s own tests plus the interface tests. If the interface tests pass, nothing outside the module can break.
Reserve the full integration suite for deployment validation, not developer feedback.

This immediately reduces pipeline duration for changes inside the bounded module. Developers get faster feedback. The pipeline is no longer “run everything for every change.”

Step 5: Repeat for the next boundary (Ongoing)

Each new boundary reduces blast radius, improves test scoping, and enables more independent ownership. Prioritize by pain:

Signal	What it tells you
Files that always change together across modules	Coupling that forces coordinated changes
Unexpected test failures after unrelated changes	Hidden dependencies through shared state
Multiple teams needing to coordinate on changes	Ownership boundaries that do not match code boundaries
Long pipeline duration from running all tests	No way to scope testing because boundaries do not exist

Over months, the system evolves from a tangle into a set of modules with defined interfaces. This is not a rewrite. It is incremental boundary enforcement applied where it matters most.

Objection	Response
“We should just rewrite it as microservices”	A rewrite takes months or years and delivers zero value until it is finished. Enforcing boundaries in the existing codebase delivers value with each boundary and does not require a big-bang migration.
“We don’t have time to refactor”	You are already paying the cost of coupling in unexpected breakage, slow testing, and coordination overhead. Each boundary you enforce reduces that ongoing cost.
“The coupling is too deep to untangle”	Start with the easiest boundary, not the hardest. Even one well-enforced boundary reduces blast radius and proves the approach works.
“Module boundaries will slow us down”	Boundaries add a small cost to cross-module changes and remove a large cost from within-module changes. Since most changes are within a module, the net effect is faster delivery.

Measuring Progress

Metric	What to look for
Unexpected cross-module test failures	Should decrease as boundaries are enforced
Change fail rate	Should decrease as blast radius shrinks
Build duration	Should decrease as testing can be scoped to affected modules
Development cycle time	Should decrease as developers spend less time tracing dependencies
Cross-team coordination requests per sprint	Should decrease as module ownership becomes clearer
Files changed per commit	Should decrease as changes become more localized

Architecture Decoupling - Strategies for creating module boundaries
Small Batches - Decoupling enables smaller, safer changes
Testing Fundamentals - Scoping tests to module boundaries
Identify Constraints - Finding the coupling that hurts most
Value Stream Mapping - Making coordination overhead visible

2 - Migrating Brownfield to CD

Already have a running system? A phased approach to migrating existing applications and teams to continuous delivery.

Most teams adopting CD are not starting from scratch. They have existing codebases, existing processes, existing habits, and existing pain. This section provides the phased migration path from where you are today to continuous delivery, without stopping feature delivery along the way.

The Reality of Brownfield Migration

Migrating an existing system to CD is harder than building CD into a greenfield project. You are working against inertia: existing branching strategies, existing test suites (or lack thereof), existing deployment processes, and existing team habits. Every change has to be made incrementally, alongside regular delivery work.

The good news: every team that has successfully adopted CD has done it this way. The practices in this guide are designed for incremental adoption, not big-bang transformation.

The Migration Phases

The migration is organized into five phases. Each phase builds on the previous one. Start with Phase 0 to understand where you are, then work through the phases in order.

Phase	Name	Goal	Key Question
0	Assess	Understand where you are	“How far are we from CD?”
1	Foundations	Daily integration, testing, small work	“Can we integrate safely every day?”
2	Pipeline	Automated path to production	“Can we deploy any commit automatically?”
3	Optimize	Improve flow, reduce batch size	“Can we deliver small changes quickly?”
4	Deliver on Demand	Deploy any change when needed	“Can we deliver any change to production when needed?”

Where to Start

If you don’t know where you stand

Start with Phase 0 - Assess. Complete the value stream mapping exercise, take baseline metrics, and fill out the current-state checklist. These activities tell you exactly where you stand and which phase to begin with.

If you know your biggest pain point

Start with Anti-Patterns. Find the problem your team feels most, and follow the links to the practices and migration phases that address it.

Quick self-assessment

If you don’t have time for a full assessment, answer these questions:

Do all developers integrate to trunk at least daily? If no, start with Phase 1.
Do you have a single automated pipeline that every change goes through? If no, start with Phase 2.
Can you deploy any green build to production on demand? If no, focus on the gap between your current state and Phase 2 completion criteria.
Do you deploy at least weekly? If no, look at Phase 3 for batch size and flow optimization.

Principles for Brownfield Migration

Do not stop delivering features

The migration is done alongside regular delivery work, not instead of it. Each practice is adopted incrementally. You do not stop the world to rewrite your test suite or redesign your pipeline.

Fix the biggest constraint first

Use your value stream map and metrics to identify which blocker is the current constraint. Fix that one thing. Then find the next constraint and fix that. Do not try to fix everything at once.

See Identify Constraints and the CD Dependency Tree.

Make progress visible

Track your DORA metrics from day one: deployment frequency, lead time for changes, change failure rate, and mean time to restore. These metrics show whether your changes are working and build the case for continued investment.

See Baseline Metrics.

Start with one team

CD adoption works best when a single team can experiment, learn, and iterate without waiting for organizational consensus. Once one team demonstrates results, other teams have a concrete example to follow.

Common Brownfield Challenges

These challenges are specific to migrating existing systems. For the full catalog of problems teams face, see Anti-Patterns.

Challenge	Why it’s hard	Approach
Large codebase with no tests	Writing tests retroactively is expensive and the ROI feels unclear	Do not try to add tests to the whole codebase. Add tests to every file you touch. Use the test-for-every-bug-fix rule. Coverage grows where it matters most.
Long-lived feature branches	The team has been using feature branches for years and the workflow feels safe	Reduce branch lifetime gradually: from two weeks to one week to two days to same-day. Do not switch to trunk overnight.
Manual deployment process	The “deployment expert” has a 50-step runbook in their head	Document the manual process first. Then automate one step at a time, starting with the most error-prone step.
Flaky test suite	Tests that randomly fail have trained the team to ignore failures	Quarantine all flaky tests immediately. They do not block the build until they are fixed. Zero tolerance for new flaky tests.
Tightly coupled architecture	Changing one module breaks others unpredictably	You do not need microservices. You need clear boundaries. Start by identifying and enforcing module boundaries within the monolith.
Organizational resistance	“We’ve always done it this way”	Start small, show results, build the case with data. One team deploying daily with lower failure rates is more persuasive than any slide deck.

Migration Timeline

These ranges assume a single team working on the migration alongside regular delivery work:

Phase	Typical Duration	Biggest Variable
Phase 0 - Assess	1-2 weeks	None - just do it
Phase 1 - Foundations	1-6 months	Current testing and trunk-based development maturity
Phase 2 - Pipeline	1-3 months	Complexity of existing deployment process
Phase 3 - Optimize	2-6 months	Organizational willingness to change batch size and approval processes
Phase 4 - Deliver on Demand	1-3 months	Confidence in pipeline and rollback capability

Do not treat these timelines as commitments. The migration is an iterative improvement process, not a project with a deadline.

Anti-Patterns - Start with the problem you feel most
Phase 0 - Assess - Understand your current state
AI Adoption Roadmap - How to safely incorporate AI into your delivery process
Common Blockers - Frequently encountered obstacles
FAQ - Frequently asked questions about CD migration

2.1 - Document Your Current Process

Before formal value stream mapping, get the team to write down every step from “ready to push” to “running in production.” Quick wins surface immediately; the documented process becomes better input for the value stream mapping session.

The Brownfield CD overview covers the migration phases, principles, and common challenges. This page covers the first practical step - documenting what actually happens today between a developer finishing a change and that change running in production.

Why Document Before Mapping

Value stream mapping is a powerful tool for systemic improvement. It requires measurement, cross-team coordination, and careful analysis. That takes time to do well, and it should not be rushed.

But you do not need a value stream map to spot obvious friction. Manual steps that could be automated, wait times caused by batching, handoffs that exist only because of process - these are visible the moment you write the process down.

Document your current process first. This gives you two things:

Quick wins you can fix this week. Obvious waste that requires no measurement or cross-team coordination to remove.
Better input for value stream mapping. When you do the formal mapping session, the team is not starting from a blank whiteboard. They have a shared, written description of what actually happens, and they have already removed the most obvious friction.

Quick wins build momentum. Teams that see immediate improvements are more willing to invest in the deeper systemic work that value stream mapping reveals.

How to Do It

Get the team together. Pick a recent change that went through the full process from “ready to push” to “running in production.” Walk through every step that happened, in order.

The rules:

Document what actually happens, not what should happen. If the official process says “automated deployment” but someone actually SSH-es into a server and runs a script, write down the SSH step.
Include the invisible steps. The Slack message asking for review. The email requesting deploy approval. The wait for the Tuesday deploy window. These are often the biggest sources of delay and they are usually missing from official process documentation.
Get the whole team in the room. Different people see different parts of the process. The developer who writes the code may not know what happens after the merge. The ops person who runs the deploy may not know about the QA handoff. You need every perspective.
Write it down as an ordered list. Not a flowchart, not a diagram, not a wiki page with sections. A simple numbered list of steps in the order they actually happen.

What to Capture for Each Step

For every step in the process, capture these details:

Field	What to Write	Example
Step name	What happens, in plain language	“QA runs manual regression tests”
Who does it	Person or role responsible	“QA engineer on rotation”
Manual or automated	Is this step done by a human or by a tool?	“Manual”
Typical duration	How long the step itself takes	“4 hours”
Wait time before it starts	How long the change sits before this step begins	“1-2 days (waits for QA availability)”
What can go wrong	Common failure modes for this step	“Tests find a bug, change goes back to dev”

The wait time column is usually more revealing than the duration column. A deploy that takes 10 minutes but only happens on Tuesdays has up to 7 days of wait time. The step itself is not the problem - the batching is.

Example: A Typical Brownfield Process

This is a realistic example of what a brownfield team’s process might look like before any CD practices are adopted. Your process will differ, but the pattern of manual steps and wait times is common.

#	Step	Who	Manual/Auto	Duration	Wait Before	What Can Go Wrong
1	Push to feature branch	Developer	Manual	Minutes	None	Merge conflicts with other branches
2	Open pull request	Developer	Manual	10 min	None	Forgot to update tests
3	Wait for code review	Developer (waiting)	Manual	-	4 hours to 2 days	Reviewer is busy, PR sits
4	Address review feedback	Developer	Manual	30 min to 2 hours	-	Multiple rounds of feedback
5	Merge to main branch	Developer	Manual	Minutes	-	Merge conflicts from stale branch
6	CI runs (build + unit tests)	CI server	Automated	15 min	Minutes	Flaky tests cause false failures
7	QA picks up ticket from board	QA engineer	Manual	-	1-3 days	QA backlog, other priorities
8	Manual functional testing	QA engineer	Manual	2-4 hours	-	Finds bug, sends back to dev
9	Request deploy approval	Team lead	Manual	5 min	-	Approver is on vacation
10	Wait for deploy window	Everyone (waiting)	-	-	1-7 days (deploys on Tuesdays)	Window missed, wait another week
11	Ops runs deployment	Ops engineer	Manual	30 min	-	Script fails, manual rollback
12	Smoke test in production	Ops engineer	Manual	15 min	-	Finds issue, emergency rollback

Total typical time: 3 to 14 days from “ready to push” to “running in production.”

Even before measurement or analysis, patterns jump out:

Steps 3, 7, and 10 are pure wait time - nothing is happening to the change.
Steps 8 and 12 are manual testing that could potentially be automated.
Step 10 is artificial batching - deploys happen on a schedule, not on demand.
Step 9 might be a rubber-stamp approval that adds delay without adding safety.

Spotting Quick Wins

Once the process is documented, look for these patterns. Each one is a potential quick win that the team can fix without a formal improvement initiative.

Automation targets

Steps that are purely manual but have well-known automation:

Code formatting and linting. If reviewers spend time on style issues, add a linter to CI. This saves reviewer time on every single PR.
Running tests. If someone manually runs tests before merging, make CI run them automatically on every push.
Build and package. If someone manually builds artifacts, automate the build in the pipeline.
Smoke tests. If someone manually clicks through the app after deploy, write a small set of automated smoke tests.

Batching delays

Steps where changes wait for a scheduled event:

Deploy windows. “We deploy on Tuesdays” means every change waits an average of 3.5 days. Moving to deploy-on-demand (even if still manual) removes this wait entirely.
QA batches. “QA tests the release candidate” means changes queue up. Testing each change as it merges removes the batch.
CAB meetings. “The change advisory board meets on Thursdays” adds up to a week of wait time per change.

Process-only handoffs

Steps where work moves between people not because of a skill requirement, but because of process:

QA sign-off that is a rubber stamp. If QA always approves and never finds issues, the sign-off is not adding value.
Approval steps that are never rejected. Track the rejection rate. If an approval step has a 0% rejection rate over the last 6 months, it is ceremony, not a gate.
Handoffs between people who sit next to each other. If the developer could do the step themselves but “process says” someone else has to, question the process.

Unnecessary steps

Steps that exist because of historical reasons and no longer serve a purpose:

Manual steps that duplicate automated checks. If CI runs the tests and someone also runs them manually “just to be sure,” the manual run is waste.
Approvals for low-risk changes. Not every change needs the same level of scrutiny. A typo fix in documentation does not need a CAB review.

Quick Wins vs. Value Stream Improvements

Not everything you find in the documented process is a quick win. Distinguish between the two:

	Quick Wins	Value Stream Improvements
Scope	Single team can fix	Requires cross-team coordination
Timeline	Days to a week	Weeks to months
Measurement	Obvious before/after	Requires baseline metrics and tracking
Risk	Low - small, reversible changes	Higher - systemic process changes
Examples	Add linter to CI, remove rubber-stamp approval, enable on-demand deploys	Restructure testing strategy, redesign deployment pipeline, change team topology

Do the quick wins now. Do not wait for the value stream mapping session. Every manual step you remove this week is one less step cluttering the value stream map and one less source of friction for the team.

Bring the documented process to the value stream mapping session. The team has already aligned on what actually happens, removed the obvious waste, and built some momentum. The value stream mapping session can focus on the systemic issues that require measurement, cross-team coordination, and deeper analysis.

What Comes Next

Fix the quick wins. Assign each one to someone with a target of this week or next week. Do not create a backlog of improvements that sits untouched.
Schedule the value stream mapping session. Use the documented process as the starting point. See Value Stream Mapping.
Start the replacement cycle. For manual validations that are not quick wins, use the Replacing Manual Validations cycle to systematically automate and remove them.

Value Stream Mapping - The formal analysis tool for systemic improvements
Replacing Manual Validations - The cycle for automating and removing manual steps
Identify Constraints - Prioritize which bottleneck to fix first
Baseline Metrics - Measure your starting point before making changes

2.2 - Replacing Manual Validations with Automation

The repeating mechanical cycle at the heart of every brownfield CD migration: identify a manual validation, automate it, prove the automation works, and remove the manual step.

The Brownfield CD overview covers the migration phases, principles, and common challenges. This page covers the core mechanical process - the specific, repeating cycle of replacing manual validations with automation that drives every phase forward.

The Replacement Cycle

Every brownfield CD migration follows the same four-step cycle, repeated until no manual validations remain between commit and production:

Identify a manual validation in the delivery process.
Automate the check so it runs in the pipeline without human intervention.
Validate that the automation catches the same problems the manual step caught.
Remove the manual step from the process.

Then pick the next manual validation and repeat.

Two rules make this cycle work:

Do not skip “validate.” Run the manual and automated checks in parallel long enough to prove the automation catches what the manual step caught. Without this evidence, the team will not trust the automation, and the manual step will creep back.
Do not skip “remove.” Keeping both the manual and automated checks adds cost without removing it. The goal is replacement, not duplication. Once the automated check is proven, retire the manual step explicitly.

Inventory Your Manual Validations

Before you can replace manual validations, you need to know what they are. A value stream map is the fastest way to find them. Walk the path from commit to production and mark every point where a human has to inspect, approve, verify, or execute something before the change can move forward.

Common manual validations and where they typically live:

Manual Validation	Where It Lives	What It Catches
Manual regression testing	QA team runs test cases before release	Functional regressions in existing features
Code style review	PR review checklist	Formatting, naming, structural consistency
Security review	Security team sign-off before deploy	Vulnerable dependencies, injection risks, auth gaps
Environment configuration	Ops team configures target environment	Missing env vars, wrong connection strings, incorrect feature flags
Smoke testing	Someone clicks through the app after deploy	Deployment-specific failures, broken integrations
Change advisory board	CAB meeting approves production changes	Risk assessment, change coordination, rollback planning
Database migration review	DBA reviews and runs migration scripts	Schema conflicts, data loss, performance regressions

Your inventory will include items not on this list. That is expected. The list above covers the most common ones, but every team has process-specific manual steps that accumulated over time.

Prioritize by Effort and Friction

Not all manual validations are equal. Some cause significant delay on every release. Others are quick and infrequent. Prioritize by mapping each validation on two axes:

Friction (vertical axis - how much pain the manual step causes):

How often does it run? (every commit, every release, quarterly)
How long does it take? (minutes, hours, days)
How often does it produce errors? (rarely, sometimes, frequently)

High-frequency, long-duration, error-prone validations cause the most friction.

Effort to automate (horizontal axis - how hard is the automation):

Is the codebase ready? (clean interfaces vs. tightly coupled)
Do tools exist? (linters, test frameworks, scanning tools)
Is the validation well-defined? (clear pass/fail vs. subjective judgment)

Start with high-friction, low-effort validations. These give you the fastest return and build momentum for harder automations later. This is the same constraint-based thinking described in Identify Constraints - fix the biggest bottleneck first.

	Low Effort	High Effort
High Friction	Start here - fastest return	Plan these - high value but need investment
Low Friction	Do these opportunistically	Defer - low return for high cost

Walkthrough: Replacing Manual Regression Testing

A concrete example of the full cycle applied to a common brownfield problem.

Starting state

The QA team runs 200 manual test cases before every release. The full regression suite takes three days. Releases happen every two weeks, so the team spends roughly 20% of every sprint on manual regression testing.

Step 1: Identify

The value stream map shows the 3-day manual regression cycle as the single largest wait time between “code complete” and “deployed.” This is the constraint.

Step 2: Automate (start small)

Do not attempt to automate all 200 test cases at once. Rank the test cases by two criteria:

Failure frequency: Which tests actually catch bugs? (In most suites, a small number of tests catch the majority of real regressions.)
Business criticality: Which tests cover the highest-risk functionality?

Pick the top 20 test cases by these criteria. Write automated tests for those 20 first. This is enough to start the validation step.

Step 3: Validate (parallel run)

Run the 20 automated tests alongside the full manual regression suite for two or three release cycles. Compare results:

Did the automated tests catch the same failures the manual tests caught?
Did the automated tests miss anything the manual tests caught?
Did the automated tests catch anything the manual tests missed?

Track these results explicitly. They are the evidence the team needs to trust the automation.

Step 4: Remove

Once the automated tests have proven equivalent for those 20 test cases across multiple cycles, remove those 20 test cases from the manual regression suite. The manual suite is now 180 test cases - taking roughly 2.7 days instead of 3.

Repeat

Pick the next 20 highest-value test cases. Automate them. Validate with parallel runs. Remove the manual cases. The manual suite shrinks with each cycle:

Cycle	Manual Test Cases	Manual Duration	Automated Tests
Start	200	3.0 days	0
1	180	2.7 days	20
2	160	2.4 days	40
3	140	2.1 days	60
4	120	1.8 days	80
5	100	1.5 days	100

Each cycle also gets faster because the team builds skill and the test infrastructure matures. For more on structuring automated tests effectively, see Testing Fundamentals and Functional Testing.

When Refactoring Is a Prerequisite

Sometimes you cannot automate a validation because the code is not structured for it. In these cases, refactoring is a prerequisite step within the replacement cycle - not a separate initiative.

Code-Level Blocker	Why It Prevents Automation	Refactoring Approach
Tight coupling between modules	Cannot test one module without setting up the entire system	Extract interfaces at module boundaries so modules can be tested in isolation
Hardcoded configuration	Cannot run the same code in test and production environments	Extract configuration into environment variables or config files
No clear entry points	Cannot call business logic without going through the UI	Extract business logic into callable functions or services
Shared mutable state	Test results depend on execution order and are not repeatable	Isolate state by passing dependencies explicitly instead of using globals
Scattered database access	Cannot test logic without a running database and specific data	Consolidate data access behind a repository layer that can be substituted in tests

The key discipline: refactor only the minimum needed for the specific validation you are automating. Do not expand the refactoring scope beyond what the current cycle requires. This keeps the refactoring small, low-risk, and tied to a concrete outcome.

For more on decoupling strategies, see Architecture Decoupling.

The Compounding Effect

Each completed replacement cycle frees time that was previously spent on manual validation. That freed time becomes available for the next automation cycle. The pace of migration accelerates as you progress:

Cycle	Manual Time per Release	Time Available for Automation	Cumulative Automated Checks
Start	5 days	Limited (squeezed between feature work)	0
After 2 cycles	4 days	1 day freed	2 validations automated
After 4 cycles	3 days	2 days freed	4 validations automated
After 6 cycles	2 days	3 days freed	6 validations automated
After 8 cycles	1 day	4 days freed	8 validations automated

Early cycles are the hardest because you have the least available time. This is why starting with the highest-friction, lowest-effort validation matters - it frees the most time for the least investment.

The same compounding dynamic applies to small batches - smaller changes are easier to validate, which makes each cycle faster, which enables even smaller changes.

Small Steps in Everything

The replacement cycle embodies the same small-batch discipline that CD itself requires. The principle applies at every level of the migration:

Automate one validation at a time. Do not try to build the entire pipeline in one sprint.
Refactor one module at a time. Do not launch a “tech debt initiative” to restructure the whole codebase before you can automate anything.
Remove one manual check at a time. Do not announce “we are eliminating manual QA” and try to do it all at once.

The risk of big-step migration:

The work stalls because the scope is too large to complete alongside feature delivery.
ROI is distant because nothing is automated until everything is automated.
Feature delivery suffers because the team is consumed by a transformation project instead of delivering value.

This connects directly to the brownfield migration principle: do not stop delivering features. The replacement cycle is designed to produce value at every iteration, not only at the end.

For more on decomposing work into small steps, see Work Decomposition.

Measuring Progress

Track these metrics to gauge migration progress. Start collecting them from baseline before you begin replacing validations.

Metric	What It Tells You	Target Direction
Manual validations remaining	How many manual steps still exist between commit and production	Down to zero
Time spent on manual validation per release	How much calendar time manual checks consume each release cycle	Decreasing each quarter
Pipeline coverage %	What percentage of validations are automated in the pipeline	Increasing toward 100%
Deployment frequency	How often you deploy to production	Increasing
Lead time for changes	Time from commit to production	Decreasing

If manual validations remaining is decreasing but deployment frequency is not increasing, you may be automating low-friction validations that are not on the critical path. Revisit your prioritization and focus on the validations that are actually blocking faster delivery.

Value Stream Mapping - Find your manual validations
Identify Constraints - Prioritize which validation to replace first
Baseline Metrics - Measure your starting point
Testing Fundamentals - Build automated tests that replace manual testing
Work Decomposition - Break migration work into small steps
Small Batches - The principle behind incremental replacement
Architecture Decoupling - Refactoring strategies for testability
Deterministic Pipeline - Where automated validations live
Common Blockers - Frequently encountered obstacles during migration
Functional Testing - Structuring automated functional tests

3 - Migration Path

A phased approach to adopting continuous delivery, from assessing your current state through full continuous deployment.

The Migration Path is a structured, phased journey from wherever you are today to continuous deployment. Each phase builds on the previous one, so work through them in order.

The Phases

Phase	Focus	Key Question
0 - Assess	Understand your current state	How far are we from CD?
1 - Foundations	Daily integration, testing, small batches	Can we integrate safely every day?
2 - Pipeline	Automated path from commit to production	Can we deploy any commit automatically?
3 - Optimize	Reduce batch size, limit WIP, measure	Can we deliver small changes quickly?
4 - Deliver on Demand	Deploy any change when the business needs it	Can we deliver any change to production when needed?

Where to Start

If you are unsure where to begin, start with Phase 0: Assess to understand your current state and identify the constraints holding you back.

3.1 - Phase 0: Assess

Understand where you are today. Map your delivery process, measure what matters, and identify the constraints holding you back.

Key question: “How far are we from CD?”

Before changing anything, you need to understand your current state. This phase helps you create a clear picture of your delivery process, establish baseline metrics, and identify the constraints that will guide your improvement roadmap.

What You’ll Do

Map your value stream - Visualize the flow from idea to production
Establish baseline metrics - Measure your current delivery performance
Identify constraints - Find the bottlenecks limiting your flow
Complete the current-state checklist - Self-assess against MinimumCD practices

Why This Phase Matters

Teams that skip assessment often invest in the wrong improvements. A team with a 3-week manual testing cycle doesn’t need better deployment automation first - they need testing fundamentals. Understanding your constraints ensures you invest effort where it will have the biggest impact.

When You’re Ready to Move On

You’re ready for Phase 1: Foundations when you can answer:

What does our value stream look like end-to-end?
What are our current lead time, deployment frequency, and change failure rate?
What are the top 3 constraints limiting our delivery flow?
Which MinimumCD practices are we missing?

3.1.1 - Value Stream Mapping

Visualize your delivery process end-to-end to identify waste and constraints before starting your CD migration.

Phase 0 - Assess | Adapted from Dojo Consortium

Before you change anything about how your team delivers software, you need to see how it works today. Value Stream Mapping (VSM) is the single most effective tool for making your delivery process visible. It reveals the waiting, the rework, and the handoffs that you have learned to live with but that are silently destroying your flow.

In the context of a CD migration, a value stream map is not an academic exercise. It is the foundation for every decision you will make in the phases ahead. It tells you where your time goes, where quality breaks down, and which constraint to attack first.

What Is a Value Stream Map?

A value stream map is a visual representation of every step required to deliver a change from request to production. For each step, you capture:

Process time - the time someone is actively working on that step
Wait time - the time the work sits idle between steps (in a queue, awaiting approval, blocked on an environment)
Percent Complete and Accurate (%C/A) - the percentage of work arriving at this step that is usable without rework

The ratio of process time to total time (process time + wait time) is your flow efficiency. Most teams are shocked to discover that their flow efficiency is below 15%, meaning that for every hour of actual work, there are nearly six hours of waiting.

Prerequisites

Before running a value stream mapping session, make sure you have:

An established, repeatable process. You are mapping what actually happens, not what should happen. If every change follows a different path, start by agreeing on the current “most common” path.
All stakeholders in the room. You need representatives from every group involved in delivery: developers, testers, operations, security, product, change management. Each person knows the wait times and rework loops in their part of the stream that others cannot see.
A shared understanding of wait time vs. process time. Wait time is when work sits idle. Process time is when someone is actively working. A code review that takes “two days” but involves 30 minutes of actual review has 30 minutes of process time and roughly 15.5 hours of wait time.

Choose Your Mapping Approach

Value stream maps can be built from two directions. Most organizations benefit from starting bottom-up and then combining into a top-down view, but the right choice depends on where your delivery pain is concentrated.

Bottom-Up: Map at the Team Level First

Each delivery team maps its own process independently - from the moment a developer is ready to push a change to the moment that change is running in production. This is the approach described in Document Your Current Process, elevated to a formal value stream map with measured process times, wait times, and %C/A.

When to use bottom-up:

You have multiple teams that each own their own deployment process (or think they do).
Teams have different pain points and different levels of CD maturity.
You want each team to own its improvement work rather than waiting for an organizational initiative.

How it works:

Each team maps its own value stream using the session format described below.
Teams identify and fix their own constraints. Many constraints are local - flaky tests, manual deployment steps, slow code review - and do not require cross-team coordination.
After teams have mapped and improved their own streams, combine the maps to reveal cross-team dependencies. Lay the team-level maps side by side and draw the connections: shared environments, shared libraries, shared approval processes, upstream/downstream dependencies.

The combined view often reveals constraints that no single team can see: a shared staging environment that serializes deployments across five teams, a security review team that is the bottleneck for every release, or a shared library with a release cycle that blocks downstream teams for weeks.

Advantages: Fast to start, builds team ownership, surfaces team-specific friction that a high-level map would miss. Teams see results quickly, which builds momentum for the harder cross-team work.

Top-Down: Map Across Dependent Teams

Start with the full flow from a customer request (or business initiative) entering the system to the delivered outcome in production, mapping across every team the work touches. This produces a single map that shows the end-to-end flow including all inter-team handoffs, shared queues, and organizational boundaries.

When to use top-down:

Delivery pain is concentrated at the boundaries between teams, not within them.
A single change routinely touches multiple teams (front-end, back-end, platform, data, etc.) and the coordination overhead dominates cycle time.
Leadership needs a full picture of organizational delivery performance to prioritize investment.

How it works:

Identify a representative value stream - a type of work that flows through the teams you want to map. For example: “a customer-facing feature that requires API changes, a front-end update, and a database migration.”
Get representatives from every team in the room. Each person maps their team’s portion of the flow, including the handoff to the next team.
Connect the segments. The gaps between teams - where work queues, waits for prioritization, or gets lost in a ticket system - are usually the largest sources of delay.

Advantages: Reveals organizational constraints that team-level maps cannot see. Shows the true end-to-end lead time including inter-team wait times. Essential for changes that require coordinated delivery across multiple teams.

Combining Both Approaches

The most effective strategy for large organizations:

Start bottom-up. Have each team document its current process and then run its own value stream mapping session. Fix team-level quick wins immediately.
Combine into a top-down view. Once team-level maps exist, connect them to see the full organizational flow. The team-level detail makes the top-down map more accurate because each segment was mapped by the people who actually do the work.
Fix constraints at the right level. Team-level constraints (flaky tests, manual deploys) are fixed by the team. Cross-team constraints (shared environments, approval bottlenecks, dependency coordination) are fixed at the organizational level.

This layered approach prevents two common failure modes: mapping at too high a level (which misses team-specific friction) and mapping only at the team level (which misses the organizational constraints that dominate end-to-end lead time).

How to Run the Session

Step 1: Start From Delivery, Work Backward

Begin at the right side of your map - the moment a change reaches production. Then work backward through every step until you reach the point where a request enters the system. This prevents teams from getting bogged down in the early stages and never reaching the deployment process, which is often where the largest delays hide.

Typical steps you will uncover include:

Request intake and prioritization
Story refinement and estimation
Development (coding)
Code review
Build and unit tests
Integration testing
Manual QA / regression testing
Security review
Staging deployment
User acceptance testing (UAT)
Change advisory board (CAB) approval
Production deployment
Production verification

Step 2: Capture Process Time and Wait Time for Each Step

For each step on the map, record the process time and the wait time. Use averages if exact numbers are not available, but prefer real data from your issue tracker, CI system, or deployment logs when you can get it.

Migration Tip

Pay close attention to these migration-critical delays:

Handoffs that block flow - Every time work passes from one team or role to another (dev to QA, QA to ops, ops to security), there is a queue. Count the handoffs. Each one is a candidate for elimination or automation.
Manual gates - CAB approvals, manual regression testing, sign-off meetings. These often add days of wait time for minutes of actual value.
Environment provisioning delays - If developers wait hours or days for a test environment, that is a constraint you will need to address in Phase 2.
Rework loops - Any step where work frequently bounces back to a previous step. Track the percentage of times this happens. These loops are destroying your cycle time.

Step 3: Calculate %C/A at Each Step

Percent Complete and Accurate measures the quality of the handoff. Ask each person: “What percentage of the work you receive from the previous step is usable without needing clarification, correction, or rework?”

A low %C/A at a step means the upstream step is producing defective output. This is critical information for your migration plan because it tells you where quality needs to be built in rather than inspected after the fact.

Step 4: Identify Constraints (Kaizen Bursts)

Mark the steps with the largest wait times and the lowest %C/A with a “kaizen burst” - a starburst symbol indicating an improvement opportunity. These are your constraints. They will become the focus of your migration roadmap.

Common constraints teams discover during their first value stream map:

Constraint	Typical Impact	Migration Phase to Address
Long-lived feature branches	Days of integration delay, merge conflicts	Phase 1 (Trunk-Based Development)
Manual regression testing	Days to weeks of wait time	Phase 1 (Testing Fundamentals)
Environment provisioning	Hours to days of wait time	Phase 2 (Production-Like Environments)
CAB / change approval boards	Days of wait time per deployment	Phase 2 (Pipeline Architecture)
Manual deployment process	Hours of process time, high error rate	Phase 2 (Single Path to Production)
Large batch releases	Weeks of accumulation, high failure rate	Phase 3 (Small Batches)

Reading the Results

Once your map is complete, calculate these summary numbers:

Total lead time = sum of all process times + all wait times
Total process time = sum of just the process times
Flow efficiency = total process time / total lead time * 100
Number of handoffs = count of transitions between different teams or roles
Rework percentage = percentage of changes that loop back to a previous step

These numbers become part of your baseline metrics and feed directly into your work to identify constraints.

What Good Looks Like

You are not aiming for a perfect value stream map. You are aiming for a shared, honest picture of reality that the whole team agrees on. The map should be:

Visible - posted on a wall or in a shared digital tool where the team sees it daily
Honest - reflecting what actually happens, including the workarounds and shortcuts
Actionable - with constraints clearly marked so the team knows where to focus

You will revisit and update this map as you progress through each migration phase. It is a living document, not a one-time exercise.

Next Step

With your value stream map in hand, proceed to Baseline Metrics to quantify your current delivery performance.

3.1.2 - Baseline Metrics

Establish baseline measurements for your current delivery performance before making any changes.

Phase 0 - Assess | Adapted from Dojo Consortium

You cannot improve what you have not measured. Before making any changes to your delivery process, you need to capture baseline measurements of your current performance. These baselines serve two purposes: they help you identify where to focus your migration effort, and they give you an honest “before” picture so you can demonstrate progress as you improve.

This is not about building a sophisticated metrics dashboard. It is about getting four numbers written down so you have a starting point.

Why Measure Before Changing

Teams that skip baseline measurement fall into predictable traps:

They cannot prove improvement. Six months into a migration, leadership asks “What has gotten better?” Without a baseline, the answer is a shrug and a feeling.
They optimize the wrong thing. Without data, teams default to fixing what is most visible or most annoying rather than what is the actual constraint.
They cannot detect regression. A change that feels like an improvement may actually make things worse in ways that are not immediately obvious.

Baselines do not need to be precise to the minute. A rough but honest measurement is vastly more useful than no measurement at all.

The Four Essential Metrics

The DORA research program (now part of Google Cloud) identified four key metrics that predict software delivery performance and organizational outcomes. These are the metrics you should baseline first.

1. Deployment Frequency

What it measures: How often your team deploys to production.

How to capture it: Count the number of production deployments in the last 30 days. Check your deployment logs, CI/CD system, or change management records. If deployments are rare enough that you remember each one, count from memory.

What it tells you:

Frequency	What It Suggests
Multiple times per day	You may already be practicing continuous delivery
Once per week	You have a regular cadence but likely batch changes
Once per month or less	Large batches, high risk per deployment, likely manual process
Varies wildly	No consistent process; deployments are event-driven

Record your number: ______ deployments in the last 30 days.

2. Lead Time for Changes

What it measures: The elapsed time from when code is committed to when it is running in production.

How to capture it: Pick your last 5-10 production deployments. For each one, find the commit timestamp of the oldest change included in that deployment and subtract it from the deployment timestamp. Take the median.

If your team uses feature branches, the clock starts at the first commit on the branch, not when the branch is merged. This captures the true elapsed time the change spent in the system.

What it tells you:

Lead Time	What It Suggests
Less than 1 hour	Fast flow, likely small batches and good automation
1 day to 1 week	Reasonable but with room for improvement
1 week to 1 month	Significant queuing, likely large batches or manual gates
More than 1 month	Major constraints in testing, approval, or deployment

Record your number: ______ median lead time for changes.

3. Change Failure Rate

What it measures: The percentage of deployments to production that result in a degraded service requiring remediation (rollback, hotfix, patch, or incident).

How to capture it: Look at your last 20-30 production deployments. Count how many caused an incident, required a rollback, or needed an immediate hotfix. Divide by the total number of deployments.

What it tells you:

Failure Rate	What It Suggests
0-5%	Strong quality practices and small change sets
5-15%	Typical for teams with some automation
15-30%	Quality gaps, likely insufficient testing or large batches
Above 30%	Systemic quality problems; changes are frequently broken

Record your number: ______ % of deployments that required remediation.

4. Mean Time to Restore (MTTR)

What it measures: How long it takes to restore service after a production failure caused by a deployment.

How to capture it: Look at your production incidents from the last 3-6 months. For each incident caused by a deployment, note the time from detection to resolution. Take the median. If you have not had any deployment-caused incidents, note that - it either means your quality is excellent or your deployment frequency is so low that you have insufficient data.

What it tells you:

MTTR	What It Suggests
Less than 1 hour	Good incident response, likely automated rollback
1-4 hours	Manual but practiced recovery process
4-24 hours	Significant manual intervention required
More than 1 day	Serious gaps in observability or rollback capability

Record your number: ______ median time to restore service.

Capturing Your Baselines

You do not need specialized tooling to capture these four numbers. Here is a practical approach:

Check your CI/CD system. Most CI/CD tools (Jenkins, GitHub Actions, GitLab CI, Azure DevOps) have deployment history. Export the last 30-90 days of deployment records.
Check your incident tracker. Pull incidents from the last 3-6 months and filter for deployment-caused issues.
Check your version control. Git log data combined with deployment timestamps gives you lead time.
Ask the team. If data is scarce, have a conversation with the team. Experienced team members can provide reasonable estimates for all four metrics.

Record these numbers somewhere the whole team can see them. A wiki page, a whiteboard, a shared document - the format does not matter. What matters is that they are written down and dated.

What About Automation?

If you already have a CI/CD system that tracks deployments, you can extract most of these numbers programmatically. But do not let the pursuit of automation delay your baseline. A spreadsheet with manually gathered numbers is perfectly adequate for Phase 0. You will build more sophisticated measurement into your pipeline in Phase 2.

What Your Baselines Tell You About Where to Focus

Your baseline metrics point toward specific constraints:

Signal	Likely Constraint	Where to Look
Low deployment frequency + high lead time	Large batches, manual process	Value Stream Map for queue times
High change failure rate	Insufficient testing, poor quality practices	Testing Fundamentals
High MTTR	No rollback capability, poor observability	Rollback
High lead time + low change failure rate	Excessive manual gates adding delay but not value	Identify Constraints

Use these signals alongside your value stream map to identify your top constraints.

A Warning About Metrics

Goodhart's Law

“When a measure becomes a target, it ceases to be a good measure.”

These metrics are diagnostic tools, not performance targets. The moment you use them to compare teams, rank individuals, or set mandated targets, people will optimize for the metric rather than for actual delivery improvement. A team can trivially improve their deployment frequency number by deploying empty changes, or reduce their change failure rate by never deploying anything risky.

Use these metrics within the team, for the team. Share trends with leadership if needed, but never publish team-level metrics as a leaderboard. The goal is to help each team understand their own delivery health, not to create competition.

Next Step

With your baselines recorded, proceed to Identify Constraints to determine which bottleneck to address first.

3.1.3 - Identify Constraints

Use your value stream map and baseline metrics to find the bottlenecks that limit your delivery flow.

Phase 0 - Assess

Your value stream map shows you where time goes. Your baseline metrics tell you how fast and how safely you deliver. Now you need to answer the most important question in your migration: What is the one thing most limiting your delivery flow right now?

This is not a question you answer by committee vote or gut feeling. It is a question you answer with the data you have already collected.

The Theory of Constraints

Eliyahu Goldratt’s Theory of Constraints offers a simple and powerful insight: every system has exactly one constraint that limits its overall throughput. Improving anything other than that constraint does not improve the system.

Consider a delivery process where code review takes 30 minutes but the queue to get a review takes 2 days, and manual regression testing takes 5 days after that. If you invest three months building a faster build pipeline that saves 10 minutes per build, you have improved something that is not the constraint. The 5-day regression testing cycle still dominates your lead time. You have made a non-bottleneck more efficient, which changes nothing about how fast you deliver.

The implication for your CD migration is direct: you must find and address constraints in order of impact. Fix the biggest one first. Then find the next one. Then fix that. This is how you make sustained, measurable progress rather than spreading effort across improvements that do not move the needle.

Common Constraint Categories

Software delivery constraints tend to cluster into a few recurring categories. As you review your value stream map, look for these patterns.

Testing Bottlenecks

Symptoms: Large wait time between “code complete” and “verified.” Manual regression test cycles measured in days or weeks. Low %C/A at the testing step, indicating frequent rework. High change failure rate in your baseline metrics despite significant testing effort.

What is happening: Testing is being done as a phase after development rather than as a continuous activity during development. Manual test suites have grown to cover every scenario ever encountered, and running them takes longer with every release. The test environment is shared and frequently broken.

Migration path: Phase 1 - Testing Fundamentals

Deployment Gates

Symptoms: Wait times of days or weeks between “tested” and “deployed.” Change Advisory Board (CAB) meetings that happen weekly or biweekly. Multiple sign-offs required from people who are not involved in the actual change.

What is happening: The organization has substituted process for confidence. Because deployments have historically been risky (large batches, manual processes, poor rollback), layers of approval have been added. These approvals add delay but rarely catch issues that automated testing would not. They exist because the deployment process is not trustworthy, and they persist because removing them feels dangerous.

Migration path: Phase 2 - Pipeline Architecture and building the automated quality evidence that makes manual approvals unnecessary.

Environment Provisioning

Symptoms: Developers waiting hours or days for a test or staging environment. “Works on my machine” failures when code reaches a shared environment. Environments that drift from production configuration over time.

What is happening: Environments are manually provisioned, shared across teams, and treated as pets rather than cattle. There is no automated way to create a production-like environment on demand. Teams queue for shared environments, and environment configuration has diverged from production.

Migration path: Phase 2 - Production-Like Environments

Code Review Delays

Symptoms: Pull requests sitting open for more than a day. Review queues with 5 or more pending reviews. Developers context-switching because they are blocked waiting for review.

What is happening: Code review is being treated as an asynchronous handoff rather than a collaborative activity. Reviews happen when the reviewer “gets to it” rather than as a near-immediate response. Large pull requests make review daunting, which increases queue time further.

Migration path: Phase 1 - Code Review and Trunk-Based Development to reduce branch lifetime and review size.

Manual Handoffs

Symptoms: Multiple steps in your value stream map where work transitions from one team to another. Tickets being reassigned across teams. “Throwing it over the wall” language in how people describe the process.

What is happening: Delivery is organized as a sequence of specialist stages (dev, test, ops, security) rather than as a cross-functional flow. Each handoff introduces a queue, a context loss, and a communication overhead. The more handoffs, the longer the lead time and the more likely that information is lost.

Migration path: This is an organizational constraint, not a technical one. It is addressed gradually through cross-functional team formation and by automating the specialist activities into the pipeline so that handoffs become automated checks rather than manual transfers.

Using Your Value Stream Map to Find the Constraint

Pull out your value stream map and follow this process:

Step 1: Rank Steps by Wait Time

List every step in your value stream and sort them by wait time, longest first. Your biggest constraint is almost certainly in the top three. Wait time is more important than process time because wait time is pure waste - nothing is happening, no value is being created.

Step 2: Look for Rework Loops

Identify steps where work frequently loops back. A testing step with a 40% rework rate means that nearly half of all changes go through the development-to-test cycle twice. The effective wait time for that step is nearly doubled when you account for rework.

Step 3: Count Handoffs

Each handoff between teams or roles is a queue point. If your value stream has 8 handoffs, you have 8 places where work waits. Look for handoffs that could be eliminated by automation or by reorganizing work within the team.

Step 4: Cross-Reference with Metrics

Check your findings against your baseline metrics:

High lead time with low process time = the constraint is in the queues (wait time), not in the work itself
High change failure rate = the constraint is in quality practices, not in speed
Low deployment frequency with everything else reasonable = the constraint is in the deployment process itself or in organizational policy

Prioritizing: Fix the Biggest One First

One Constraint at a Time

Resist the temptation to tackle multiple constraints simultaneously. The Theory of Constraints is clear: improving a non-bottleneck does not improve the system. Identify the single biggest constraint, focus your migration effort there, and only move to the next constraint when the first one is no longer the bottleneck.

This does not mean the entire team works on one thing. It means your improvement initiatives are sequenced to address constraints in order of impact.

Once you have identified your top constraint, map it to a migration phase:

If Your Top Constraint Is…	Start With…
Integration and merge conflicts	Phase 1 - Trunk-Based Development
Manual testing cycles	Phase 1 - Testing Fundamentals
Large work items that take weeks	Phase 1 - Work Decomposition
Code review bottlenecks	Phase 1 - Code Review
Manual or inconsistent deployments	Phase 2 - Single Path to Production
Environment availability	Phase 2 - Production-Like Environments
Change approval processes	Phase 2 - Pipeline Architecture
Large batch sizes	Phase 3 - Small Batches

The Next Constraint

Fixing your first constraint will improve your flow. It will also reveal the next constraint. This is expected and healthy. A delivery process is a chain, and strengthening the weakest link means a different link becomes the weakest.

This is why the migration is organized in phases. Phase 1 addresses the foundational constraints that nearly every team has (integration practices, testing, small work). Phase 2 addresses pipeline constraints. Phase 3 optimizes flow. You will cycle through constraint identification and resolution throughout your migration.

Plan to revisit your value stream map and metrics after addressing each major constraint. Your map from today will be outdated within weeks of starting your migration - and that is a sign of progress.

Next Step

Complete the Current State Checklist to assess your team against specific MinimumCD practices and confirm your migration starting point.

3.1.4 - Current State Checklist

Self-assess your team against MinimumCD practices to understand your starting point and determine where to begin your migration.

Phase 0 - Assess

This checklist translates the practices defined by MinimumCD.org into concrete yes-or-no questions you can answer about your team today. It is not a test to pass. It is a diagnostic tool that shows you which practices are already in place and which ones your migration needs to establish.

Work through each category with your team. Be honest - checking a box you have not earned gives you a migration plan that skips steps you actually need.

How to Use This Checklist

For each item, mark it with an [x] if your team consistently does this today - not occasionally, not aspirationally, but as a default practice. If you do it sometimes but not reliably, leave it unchecked.

Trunk-Based Development

All developers integrate their work to the trunk (main branch) at least once every 24 hours
No branch lives longer than 24 hours before being integrated
The team does not use code freeze periods to stabilize for release
There are fewer than 3 active branches at any given time
Merge conflicts are rare and small when they occur

Why it matters: Long-lived branches are the single biggest source of integration risk. Every hour a branch lives is an hour where it diverges from what everyone else is doing. Trunk-based development eliminates integration as a separate, painful event and makes it a continuous, trivial activity. Without this practice, continuous integration is impossible, and without continuous integration, continuous delivery is impossible.

Continuous Integration

Every commit to trunk triggers an automated build
The automated build includes running the full unit test suite
All tests must pass before any change is merged to trunk
A broken build is treated as the team’s top priority to fix (not left broken while other work continues)
The build and test cycle completes in less than 10 minutes

Why it matters: Continuous integration means that the team always knows whether the codebase is in a working state. If builds are not automated, if tests do not run on every commit, or if broken builds are tolerated, then the team is flying blind. Every change is a gamble that something else has not broken in the meantime.

Pipeline Practices

There is a single, defined path that every change follows to reach production (no side doors, no manual deployments, no exceptions)
The pipeline is deterministic: given the same input commit, it produces the same output every time
Build artifacts are created once and promoted through environments (not rebuilt for each environment)
The pipeline runs automatically on every commit to trunk without manual triggering
Pipeline failures provide clear, actionable feedback that developers can act on within minutes

Why it matters: A pipeline is the mechanism that turns code changes into production deployments. If the pipeline is inconsistent, manual, or bypassable, then you do not have a reliable path to production. You have a collection of scripts and hopes. Deterministic, automated pipelines are what make deployment a non-event rather than a high-risk ceremony.

Deployment

The team has at least one environment that closely mirrors production configuration (OS, middleware, networking, data shape)
Application configuration is externalized from the build artifact (config files, environment variables, or a config service - not baked into the binary)
The team can roll back a production deployment within minutes, not hours
Deployments to production do not require downtime
The deployment process is the same for every environment (dev, staging, production) with only configuration differences

Why it matters: If your test environment does not look like production, your tests are lying to you. If configuration is baked into your artifact, you are rebuilding for each environment, which means the thing you tested is not the thing you deploy. If you cannot roll back quickly, every deployment is a high-stakes bet. These practices ensure that what you test is what you ship, and that shipping is safe.

Quality

The team has automated tests at multiple levels (unit, integration, and at least some end-to-end)
A build that passes all automated checks is considered deployable without additional manual verification
There are no manual quality gates between a green build and production (no manual QA sign-off, no manual regression testing required)
Defects found in production are addressed by adding automated tests that would have caught them, not by adding manual inspection steps
The team monitors production health and can detect deployment-caused issues within minutes

Why it matters: Quality that depends on manual inspection does not scale and does not speed up. As your deployment frequency increases through the migration, manual quality gates become the bottleneck. The goal is to build quality in through automation so that a green build means a deployable build. This is the foundation of continuous delivery: if it passes the pipeline, it is ready for production.

Scoring Guide

Count the number of items you checked across all categories.

Score	Your Starting Point	Recommended Phase
0-5	You are early in your journey. Most foundational practices are not yet in place.	Start at the beginning of Phase 1 - Foundations. Focus on trunk-based development and basic test automation first.
6-12	You have some practices in place but significant gaps remain. This is the most common starting point.	Start with Phase 1 - Foundations but focus on the categories where you had the fewest checks. Your constraint analysis will tell you which gap to close first.
13-18	Your foundations are solid. The gaps are likely in pipeline automation and deployment practices.	You may be able to move quickly through Phase 1 and focus your effort on Phase 2 - Pipeline. Validate with your value stream map that your remaining constraints match.
19-22	You are well-practiced in most areas. Your migration is about closing specific gaps and optimizing flow.	Review your unchecked items - they point to specific topics in Phase 3 - Optimize or Phase 4 - Deliver on Demand.
23-25	You are already practicing most of what MinimumCD defines. Your focus should be on consistency and delivering on demand.	Jump to Phase 4 - Deliver on Demand and focus on the capability to deploy any change when needed.

A Score Is Not a Grade

This checklist exists to help your team find its starting point, not to judge your team’s competence. A score of 5 does not mean your team is failing - it means your team has a clear picture of what to work on. A score of 22 does not mean you are done - it means your remaining gaps are specific and targeted.

The only wrong answer is a dishonest one.

Putting It All Together

You now have four pieces of information from Phase 0:

A value stream map showing your end-to-end delivery process with wait times and rework loops
Baseline metrics for deployment frequency, lead time, change failure rate, and MTTR
An identified top constraint telling you where to focus first
This checklist confirming which practices are in place and which are missing

Together, these give you a clear, data-informed starting point for your migration. You know where you are, you know what is slowing you down, and you know which practices to establish first.

Next Step

You are ready to begin Phase 1 - Foundations. Start with the practice area that addresses your top constraint.

3.2 - Phase 1: Foundations

Establish the essential practices for daily integration, testing, and small work decomposition.

Key question: “Can we integrate safely every day?”

This phase establishes the development practices that make continuous delivery possible. Without these foundations, pipeline automation just speeds up a broken process.

What You’ll Do

Adopt trunk-based development - Integrate to trunk at least daily
Build testing fundamentals - Create a fast, reliable test suite
Automate your build - One command to build, test, and package
Decompose work - Break features into small, deliverable increments
Streamline code review - Fast, effective review that doesn’t block flow
Establish working agreements - Shared definitions of done and ready
Everything as code - Infrastructure, pipelines, schemas, monitoring, and security policies in version control, delivered through pipelines

Why This Phase Matters

These practices are the prerequisites for everything that follows. Trunk-based development eliminates merge hell. Testing fundamentals give you the confidence to deploy frequently. Small work decomposition reduces risk per change. Together, they create the feedback loops that drive continuous improvement.

When You’re Ready to Move On

You’re ready for Phase 2: Pipeline when:

All developers integrate to trunk at least once per day
Your test suite catches real defects and runs in under 10 minutes
You can build and package your application with a single command
Most work items are completable within 2 days

3.2.1 - Trunk-Based Development

Integrate all work to the trunk at least once per day to enable continuous integration.

Phase 1 - Foundations | Adapted from MinimumCD.org

Trunk-based development is the first foundation to establish. Without daily integration to a shared trunk, the rest of the CD migration cannot succeed. This page covers the core practice, two migration paths, and a tactical guide for getting started.

What Is Trunk-Based Development?

Trunk-based development (TBD) is a branching strategy where all developers integrate their work into a single shared branch - the trunk - at least once per day. The trunk is always kept in a releasable state.

This is a non-negotiable prerequisite for continuous delivery. If your team is not integrating to trunk daily, you are not doing CI, and you cannot do CD. There is no workaround.

What TBD Is Not

It is not “everyone commits directly to main with no guardrails.” You still test, review, and validate work - you just do it in small increments.
It is not incompatible with code review. It requires review to happen quickly.
It is not reckless. It is the opposite: small, frequent integrations are far safer than large, infrequent merges.

What Trunk-Based Development Improves

Problem	How TBD Helps
Merge conflicts	Small changes integrated frequently rarely conflict
Integration risk	Bugs are caught within hours, not weeks
Long-lived branches diverge from reality	The trunk always reflects the current state of the codebase
“Works on my branch” syndrome	Everyone shares the same integration point
Slow feedback	CI runs on every integration, giving immediate signal
Large batch deployments	Small changes are individually deployable
Fear of deployment	Each change is small enough to reason about

Two Migration Paths

There are two valid approaches to trunk-based development. Both satisfy the minimum CD requirement of daily integration. Choose the one that fits your team’s current maturity and constraints.

Path 1: Short-Lived Branches

Developers create branches that live for less than 24 hours. Work is done on the branch, reviewed quickly, and merged to trunk within a single day.

How it works:

Pull the latest trunk
Create a short-lived branch
Make small, focused changes
Open a pull request (or use pair programming as the review)
Merge to trunk before end of day
The branch is deleted after merge

Best for teams that:

Currently use long-lived feature branches and need a stepping stone
Have regulatory requirements for traceable review records
Use pull request workflows they want to keep (but make faster)
Are new to TBD and want a gradual transition

Key constraint: The branch must merge to trunk within 24 hours. If it does not, you have a long-lived branch and you have lost the benefit of TBD.

Path 2: Direct Trunk Commits

Developers commit directly to trunk. Quality is ensured through pre-commit checks, pair programming, and strong automated testing.

How it works:

Pull the latest trunk
Make a small, tested change locally
Run the local build and test suite
Push directly to trunk
CI validates the commit immediately

Best for teams that:

Have strong automated test coverage
Practice pair or mob programming (which provides real-time review)
Want maximum integration frequency
Have high trust and shared code ownership

Key constraint: This requires excellent test coverage and a culture where the team owns quality collectively. Without these, direct trunk commits become reckless.

How to Choose Your Path

Ask these questions:

Do you have automated tests that catch real defects? If no, start with Path 1 and invest in testing fundamentals in parallel.
Does your organization require documented review approvals? If yes, use Path 1 with rapid pull requests.
Does your team practice pair programming? If yes, Path 2 may work immediately - pairing is a continuous review process.
How large is your team? Teams of 2-4 can adopt Path 2 more easily. Larger teams may start with Path 1 and transition later.

Both paths are valid. The important thing is daily integration to trunk. Do not spend weeks debating which path to use. Pick one, start today, and adjust.

Essential Supporting Practices

Trunk-based development does not work in isolation. These supporting practices make daily integration safe and sustainable.

Feature Flags

When you integrate to trunk daily, incomplete features will exist on trunk. Feature flags let you merge code that is not yet ready for users.

# Simple feature flag example
if feature_flags.is_enabled("new-checkout-flow", user):
    return new_checkout(cart)
else:
    return legacy_checkout(cart)

Rules for feature flags in TBD:

Use flags to decouple deployment from release
Remove flags within days or weeks - they are temporary by design
Keep flag logic simple; avoid nested or dependent flags
Test both flag states in your automated test suite

Feature flags are covered in more depth in Phase 3: Optimize.

Commit Small, Commit Often

Each commit should be a small, coherent change that leaves trunk in a working state. If you are committing once a day in a large batch, you are not getting the benefit of TBD.

Guidelines:

Each commit should be independently deployable
A commit should represent a single logical change
If you cannot describe the change in one sentence, it is too big
Target multiple commits per day, not one large commit at end of day

Test-Driven Development (TDD) and ATDD

TDD provides the safety net that makes frequent integration sustainable. When every change is accompanied by tests, you can integrate confidently.

TDD: Write the test before the code. Red, green, refactor.
ATDD (Acceptance Test-Driven Development): Write acceptance criteria as executable tests before implementation.

Both practices ensure that your test suite grows with your code and that trunk remains releasable.

Getting Started: A Tactical Guide

Step 1: Shorten Your Branches (Week 1)

If your team currently uses long-lived feature branches, start by shortening their lifespan.

Current State	Target
Branches live for weeks	Branches live for < 1 week
Merge once per sprint	Merge multiple times per week
Large merge conflicts are normal	Conflicts are rare and small

Action: Set a team agreement that no branch lives longer than 2 days. Track branch age as a metric.

Step 2: Integrate Daily (Week 2-3)

Tighten the window from 2 days to 1 day.

Action:

Every developer merges to trunk at least once per day, every day they write code
If work is not complete, use a feature flag or other technique to merge safely
Track integration frequency as your primary metric

Step 3: Ensure Trunk Stays Green (Week 2-3)

Daily integration is only useful if trunk remains in a releasable state.

Action:

Run your test suite on every merge to trunk
If the build breaks, fixing it becomes the team’s top priority
Establish a working agreement: “broken build = stop the line” (see Working Agreements)

Step 4: Remove the Safety Net of Long Branches (Week 4+)

Once the team is integrating daily with a green trunk, eliminate the option of long-lived branches.

Action:

Configure branch protection rules to warn or block branches older than 24 hours
Remove any workflow that depends on long-lived branches (e.g., “dev” or “release” branches)
Celebrate the transition - this is a significant shift in how the team works

Key Pitfalls

1. “We integrate daily, but we also keep our feature branches”

If you are merging to trunk daily but also maintaining a long-lived feature branch, you are not doing TBD. The feature branch will diverge, and merging it later will be painful. The integration to trunk must be the only integration point.

2. “Our builds are too slow for frequent integration”

If your CI pipeline takes 30 minutes, integrating multiple times a day feels impractical. This is a real constraint - address it by investing in build automation and parallelizing your test suite. Target a build time under 10 minutes.

3. “We can’t integrate incomplete features to trunk”

Yes, you can. Use feature flags to hide incomplete work from users. The code exists on trunk, but the feature is not active. This is a standard practice at every company that practices CD.

4. “Code review takes too long for daily integration”

If pull request reviews take 2 days, daily integration is impossible. The solution is to change how you review: pair programming provides continuous review, mob programming reviews in real time, and small changes can be reviewed asynchronously in minutes. See Code Review for specific techniques.

5. “What if someone pushes a bad commit to trunk?”

This is why you have automated tests, CI, and the “broken build = top priority” agreement. Bad commits will happen. The question is how fast you detect and fix them. With TBD and CI, the answer is minutes, not days.

Measuring Success

Track these metrics to verify your TBD adoption:

Metric	Target	Why It Matters
Integration frequency	At least 1 per developer per day	Confirms daily integration is happening
Branch age	< 24 hours	Catches long-lived branches
Build duration	< 10 minutes	Enables frequent integration without frustration
Merge conflict frequency	Decreasing over time	Confirms small changes reduce conflicts

Next Step

Once your team is integrating to trunk daily, build the test suite that makes that integration trustworthy. Continue to Testing Fundamentals.

3.2.2 - Testing Fundamentals

Build a test architecture that gives your pipeline the confidence to deploy any change, even when dependencies outside your control are unavailable.

Phase 1 - Foundations | Adapted from Dojo Consortium

Before you can trust your pipeline, you need a test suite that is fast, deterministic, and catches real defects. But a collection of tests is not enough. You need a test architecture - a deliberate structure where different types of tests work together to give you the confidence to deploy every change, regardless of whether external systems are up, slow, or behaving unexpectedly.

Why Testing Is a Foundation

Continuous delivery requires that trunk always be releasable. The only way to know trunk is releasable is to test it - automatically, on every change. Without a reliable test suite, daily integration is just daily risk.

In many organizations, testing is the single biggest obstacle to CD adoption. Not because teams lack tests, but because the tests they have are slow, flaky, poorly structured, and - most critically - unable to give the pipeline a reliable answer to the question: is this change safe to deploy?

Testing Goals for CD

Your test suite must meet these criteria before it can support continuous delivery:

Goal	Target	Why
Fast	Full suite completes in under 10 minutes	Developers need feedback before context-switching
Deterministic	Same code always produces the same test result	Flaky tests destroy trust and get ignored
Catches real bugs	Tests fail when behavior is wrong, not when implementation changes	Brittle tests create noise, not signal
Independent of external systems	Pipeline can determine deployability without any dependency being available	Your ability to deploy cannot be held hostage by someone else’s outage

If your test suite does not meet these criteria today, improving it is your highest-priority foundation work.

Beyond the Test Pyramid

The test pyramid - many unit tests at the base, fewer integration tests in the middle, a handful of end-to-end tests at the top - has been the dominant mental model for test strategy since Mike Cohn introduced it. The core insight is sound: push testing as low as possible. Lower-level tests are faster, more deterministic, and cheaper to maintain. Higher-level tests are slower, more brittle, and more expensive.

But as a prescriptive model, the pyramid is overly simplistic. Teams that treat it as a rigid ratio end up in unproductive debates about whether they have “too many” integration tests or “not enough” unit tests. The shape of your test distribution matters far less than whether your tests, taken together, give you the confidence to deploy.

What actually matters

The pyramid’s principle - write tests with different granularity - remains correct. But for CD, the question is not “do we have the right pyramid shape?” The question is:

This reframes the testing conversation. Instead of counting tests by type and trying to match a diagram, you design a test architecture where:

Fast, deterministic tests catch the vast majority of defects and run on every commit. These tests use test doubles for anything outside the team’s control. They give you a reliable go/no-go signal in minutes.
Contract tests verify that your test doubles still match reality. They run asynchronously and catch drift between your assumptions and the real world - without blocking your pipeline.
A small number of non-deterministic tests validate that the fully integrated system works. These run post-deployment and provide monitoring, not gating.

This structure means your pipeline can confidently say “yes, deploy this” even if a downstream API is having an outage, a third-party service is slow, or a partner team hasn’t deployed their latest changes yet. Your ability to deliver is decoupled from the reliability of systems you do not own.

The anti-pattern: the ice cream cone

Most teams that struggle with CD have an inverted test distribution - too many slow, expensive end-to-end tests and too few fast, focused tests.

        ┌─────────────────────────┐
        │    Manual Testing       │  ← Most testing happens here
        ├─────────────────────────┤
        │   End-to-End Tests      │  ← Slow, flaky, expensive
        ├─────────────────────────┤
        │  Integration Tests      │  ← Some, but not enough
        ├───────────┤
        │Unit Tests │              ← Too few
        └───────────┘

The ice cream cone makes CD impossible. Manual testing gates block every release. End-to-end tests take hours, fail randomly, and depend on external systems being healthy. The pipeline cannot give a fast, reliable answer about deployability, so deployments become high-ceremony events.

Test Architecture for the CD Pipeline

A test architecture is the deliberate structure of how different test types work together across your pipeline to give you deployment confidence. Each layer has a specific role, and the layers reinforce each other.

Layer 1: Unit tests - verify logic in isolation

Unit tests exercise individual functions, methods, or components with all external dependencies replaced by test doubles. They are the fastest and most deterministic tests you have.

Role in CD: Catch logic errors, regressions, and edge cases instantly. Provide the tightest feedback loop - developers should see results in seconds while coding.

What they cannot do: Verify that components work together, that your code correctly calls external services, or that the system behaves correctly as a whole.

See Unit Tests for detailed guidance.

Layer 2: Integration tests - verify boundaries

Integration tests verify that components interact correctly at their boundaries: database queries return the expected data, HTTP clients serialize requests correctly, message producers format messages as expected. External systems are replaced with test doubles, but internal collaborators are real.

Role in CD: Catch the bugs that unit tests miss - mismatched interfaces, serialization errors, query bugs. These tests are fast enough to run on every commit but realistic enough to catch real integration failures.

What they cannot do: Verify that the system works end-to-end from a user’s perspective, or that your assumptions about external services are still correct.

The line between unit tests and integration tests is often debated. As Ham Vocke writes in The Practical Test Pyramid: the naming matters less than the discipline. The key question is whether the test is fast, deterministic, and tests something your unit tests cannot. If yes, it belongs here.

See Integration Tests for detailed guidance.

Layer 3: Functional tests - verify your system works in isolation

Functional tests (also called component tests) exercise your entire sub-system - your service, your application - from the outside, as a user or consumer would interact with it. All external dependencies are replaced with test doubles. The test boots your application, sends real HTTP requests or simulates real user interactions, and verifies the responses.

Role in CD: This is the layer that proves your system works as a complete unit, independent of everything else. Functional tests answer: “if we deploy this service right now, will it behave correctly for every interaction that is within our control?” Because all external dependencies are stubbed, these tests are deterministic and fast. They can run on every commit.

Why this layer is critical for CD: Functional tests are what allow you to deploy with confidence even when dependencies outside your control are unavailable. Your test doubles simulate the expected behavior of those dependencies. As long as your doubles are accurate (which is what contract tests verify), your functional tests prove your system handles those interactions correctly.

See Functional Tests for detailed guidance.

Layer 4: Contract tests - verify your assumptions about others

Contract tests validate that the test doubles you use in layers 1-3 still accurately represent the real external systems. They run against live dependencies and check contract format - response structures, field names, types, and status codes - not specific data values.

Role in CD: Contract tests are the bridge between your fast, deterministic test suite and the real world. Without them, your test doubles can silently drift from reality, and your functional tests provide false confidence. With them, you know that the assumptions baked into your test doubles are still correct.

Consumer-driven contracts take this further: the consumer of an API publishes expectations (using tools like Pact), and the provider runs those expectations as part of their build. Both teams know immediately when a change would break the contract.

Contract tests are non-deterministic because they hit live systems. They should not block your pipeline. Instead, failures trigger a review: has the contract changed, or was it a transient network issue? If the contract has changed, update your test doubles and re-verify.

See Contract Tests for detailed guidance.

Layer 5: End-to-end tests - verify the integrated system post-deployment

End-to-end tests validate complete user journeys through the fully integrated system with no test doubles. They run against real services, real databases, and real third-party integrations.

Role in CD: E2E tests are monitoring, not gating. They run after deployment to verify that the integrated system works. A small suite of smoke tests can run immediately post-deployment to catch gross integration failures. Broader E2E suites run on a schedule.

Why E2E tests should not gate your pipeline: E2E tests are non-deterministic. They fail for reasons unrelated to your change - network blips, third-party outages, shared environment instability. If your pipeline depends on E2E tests passing before you can deploy, your deployment frequency is limited by the reliability of every system in the chain. This is the opposite of the independence CD requires.

See End-to-End Tests for detailed guidance.

How the layers work together

Pipeline stage    Test layer              Deterministic?   Blocks deploy?
─────────────────────────────────────────────────────────────────────────
On every commit   Unit tests              Yes              Yes
                  Integration tests       Yes              Yes
                  Functional tests        Yes              Yes

Asynchronous      Contract tests          No               No (triggers review)

Post-deployment   E2E smoke tests         No               Triggers rollback if critical
                  Synthetic monitoring    No               Triggers alerts

The critical insight: everything that blocks deployment is deterministic and under your control. Everything that involves external systems runs asynchronously or post-deployment. This is what gives you the independence to deploy any time, regardless of the state of the world around you.

Week 1 Action Plan

If your test suite is not yet ready to support CD, use this focused action plan to make immediate progress.

Day 1-2: Audit your current test suite

Assess where you stand before making changes.

Actions:

Run your full test suite 3 times. Note total duration and any tests that pass intermittently (flaky tests).
Count tests by type: unit, integration, functional, end-to-end.
Identify tests that require external dependencies (databases, APIs, file systems) to run.
Record your baseline: total test count, pass rate, duration, flaky test count.
Map each test type to a pipeline stage. Which tests gate deployment? Which run asynchronously? Which tests couple your deployment to external systems?

Output: A clear picture of your test distribution and the specific problems to address.

Day 2-3: Fix or remove flaky tests

Flaky tests are worse than no tests. They train developers to ignore failures, which means real failures also get ignored.

Actions:

Quarantine all flaky tests immediately. Move them to a separate suite that does not block the build.
For each quarantined test, decide: fix it (if the behavior it tests matters) or delete it (if it does not).
Common causes of flakiness: timing dependencies, shared mutable state, reliance on external services, test order dependencies.
Target: zero flaky tests in your main test suite by end of week.

Day 3-4: Decouple your pipeline from external dependencies

This is the highest-leverage change for CD. Identify every test that calls a real external service and replace that dependency with a test double.

Actions:

List every external service your tests depend on: databases, APIs, message queues, file storage, third-party services.
For each dependency, decide the right test double approach:
- In-memory fakes for databases (e.g., SQLite, H2, testcontainers with local instances).
- HTTP stubs for external APIs (e.g., WireMock, nock, MSW).
- Fakes for message queues, email services, and other infrastructure.
Replace the dependencies in your unit, integration, and functional tests.
Move the original tests that hit real services into a separate suite - these become your starting contract tests or E2E smoke tests.

Output: A test suite where everything that blocks the build is deterministic and runs without network access to external systems.

Day 4-5: Add functional tests for critical paths

If you don’t have functional tests (component tests) that exercise your whole service in isolation, start with the most critical paths.

Actions:

Identify the 3-5 most critical user journeys or API endpoints in your application.
Write a functional test for each: boot the application, stub external dependencies, send a real request or simulate a real user action, verify the response.
Each functional test should prove that the feature works correctly assuming external dependencies behave as expected (which your test doubles encode).
Run these in CI on every commit.

Day 5: Set up contract tests for your most important dependency

Pick the external dependency that changes most frequently or has caused the most production issues. Set up a contract test for it.

Actions:

Write a contract test that validates the response structure (types, required fields, status codes) of the dependency’s API.
Run it on a schedule (e.g., every hour or daily), not on every commit.
When it fails, update your test doubles to match the new reality and re-verify your functional tests.
If the dependency is owned by another team in your organization, explore consumer-driven contracts with a tool like Pact.

Test-Driven Development (TDD)

TDD is the practice of writing the test before the code. It is the most effective way to build a reliable test suite because it ensures every piece of behavior has a corresponding test.

The TDD cycle:

Red: Write a failing test that describes the behavior you want.
Green: Write the minimum code to make the test pass.
Refactor: Improve the code without changing the behavior. The test ensures you do not break anything.

Why TDD supports CD:

Every change is automatically covered by a test
The test suite grows proportionally with the codebase
Tests describe behavior, not implementation, making them more resilient to refactoring
Developers get immediate feedback on whether their change works

TDD is not mandatory for CD, but teams that practice TDD consistently have significantly faster and more reliable test suites.

Getting started with TDD

If your team is new to TDD, start small:

Pick one new feature or bug fix this week.
Write the test first, watch it fail.
Write the code to make it pass.
Refactor.
Repeat for the next change.

Do not try to retroactively TDD your entire codebase. Apply TDD to new code and to any code you modify.

Testing Matrix

Use this reference to decide what type of test to write and where it runs in your pipeline.

What You Need to Verify	Test Type	Speed	Deterministic?	Blocks Deploy?
A function or method behaves correctly	Unit	Milliseconds	Yes	Yes
Components interact correctly at a boundary	Integration	Milliseconds to seconds	Yes	Yes
Your whole service works in isolation	Functional	Seconds	Yes	Yes
Your test doubles match reality	Contract	Seconds	No	No
A critical user journey works end-to-end	E2E	Minutes	No	No
Code quality, security, and style compliance	Static Analysis	Seconds	Yes	Yes

Best Practices Summary

Do

Run tests on every commit. If tests do not run automatically, they will be skipped.
Keep the deterministic suite under 10 minutes. If it is slower, developers will stop running it locally.
Fix broken tests immediately. A broken test is equivalent to a broken build.
Delete tests that do not provide value. A test that never fails and tests trivial behavior is maintenance cost with no benefit.
Test behavior, not implementation. Tests should verify what the code does, not how it does it. As Ham Vocke advises: “if I enter values x and y, will the result be z?” - not the sequence of internal calls that produce z.
Use test doubles for external dependencies. Your deterministic tests should run without network access to external systems.
Validate test doubles with contract tests. Test doubles that drift from reality give false confidence.
Treat test code as production code. Give it the same care, review, and refactoring attention.

Do Not

Do not tolerate flaky tests. Quarantine or delete them immediately.
Do not gate your pipeline on non-deterministic tests. E2E and contract test failures should trigger review or alerts, not block deployment.
Do not couple your deployment to external system availability. If a third-party API being down prevents you from deploying, your test architecture has a critical gap.
Do not write tests after the fact as a checkbox exercise. Tests written without understanding the behavior they verify add noise, not value.
Do not test private methods directly. Test the public interface; private methods are tested indirectly.
Do not share mutable state between tests. Each test should set up and tear down its own state.
Do not use sleep/wait for timing-dependent tests. Use explicit waits, polling, or event-driven assertions.
Do not require a running database or external service for unit tests. That makes them integration tests - which is fine, but categorize them correctly.

Using Tests to Find and Eliminate Defect Sources

A test suite that catches bugs is good. A test suite that helps you stop producing those bugs is transformational. Every test failure is evidence of a defect, and every defect has a source. If you treat test failures only as things to fix, you are doing rework. If you treat them as diagnostic data about where your process breaks down, you can make systemic changes that prevent entire categories of defects from occurring.

This is the difference between a team that writes more tests to catch more bugs and a team that changes how it works so that fewer bugs are created in the first place.

Trace every defect to its origin

When a test catches a defect - or worse, when a defect escapes to production - ask: where was this defect introduced, and what would have prevented it from being created?

Defects do not originate randomly. They cluster around specific causes, and each cause has a systemic fix:

Where Defects Originate	Example Defects	Detection Method	Systemic Fix
Requirements	Building the right thing wrong, or the wrong thing right	UX analytics, task completion tracking, A/B testing	Acceptance criteria as user outcomes, not implementation tasks. Three Amigos sessions before work starts. Example mapping to surface edge cases before coding begins.
Missing domain knowledge	Business rules encoded incorrectly, implicit assumptions	Magic number detection, knowledge-concentration metrics	Embed domain rules in code using ubiquitous language (DDD). Pair programming to spread knowledge. Living documentation generated from code.
Integration boundaries	Interface mismatches, wrong assumptions about upstream behavior	Consumer-driven contract tests, schema validation	Contract tests mandatory per boundary. API-first design. Document behavioral contracts, not just data schemas.
Untested edge cases	Null handling, boundary values, error paths	Mutation testing, branch coverage thresholds, property-based testing	Require a test for every bug fix. Adopt property-based testing for logic with many input permutations. Boundary value analysis as a standard practice.
Unintended side effects	Change to module A breaks module B	Mutation testing, change impact analysis	Small focused commits. Trunk-based development (integrate daily so side effects surface immediately). Modular design with clear boundaries.
Accumulated complexity	Defects cluster in the most complex, most-changed files	Complexity trends, duplication scoring, dependency cycle detection	Refactoring as part of every story, not deferred to a “tech debt sprint.” Dedicated complexity budget.
Long-lived branches	Merge conflicts, integration failures, stale code	Branch age alerts, merge conflict frequency	Trunk-based development. Merge at least daily. CI rejects stale branches.
Configuration drift	Works in staging, fails in production	IaC drift detection, environment comparison, smoke tests	All infrastructure as code. Same provisioning for every environment. Immutable infrastructure.
Data assumptions	Null pointer exceptions, schema migration failures	Null safety static analysis, schema compatibility checks, migration dry-runs	Enforce null-safe types. Expand-then-contract for all schema changes.

Build a defect feedback loop

Knowing the categories is not enough. You need a process that systematically connects test failures to root causes and root causes to systemic fixes.

Step 1: Classify every defect. When a test fails or a bug is reported, tag it with its origin category from the table above. This takes seconds and builds a dataset over time.

Step 2: Look for patterns. Monthly (or during retrospectives), review the defect classifications. Which categories appear most often? That is where your process is weakest.

Step 3: Apply the systemic fix, not just the local fix. When you fix a bug, also ask: what systemic change would prevent this entire category of bug? If most defects come from integration boundaries, the fix is not “write more integration tests” - it is “make contract tests mandatory for every new boundary.” If most defects come from untested edge cases, the fix is not “increase code coverage” - it is “adopt property-based testing as a standard practice.”

Step 4: Measure whether the fix works. Track defect counts by category over time. If you applied a systemic fix for integration boundary defects and the count does not drop, the fix is not working and you need a different approach.

The test-for-every-bug-fix rule

One of the most effective systemic practices: every bug fix must include a test that reproduces the bug before the fix and passes after. This is non-negotiable for CD because:

It proves the fix actually addresses the defect (not just the symptom).
It prevents the same defect from recurring.
It builds test coverage exactly where the codebase is weakest - the places where bugs actually occur.
Over time, it shifts your test suite from “tests we thought to write” to “tests that cover real failure modes.”

Advanced detection techniques

As your test architecture matures, add techniques that find defects humans overlook:

Technique	What It Finds	When to Adopt
Mutation testing (Stryker, PIT)	Tests that pass but do not actually verify behavior - your test suite’s blind spots	When basic coverage is in place but defect escape rate is not dropping
Property-based testing	Edge cases and boundary conditions across large input spaces that example-based tests miss	When defects cluster around unexpected input combinations
Chaos engineering	Failure modes in distributed systems - what happens when a dependency is slow, returns errors, or disappears	When you have functional tests and contract tests in place and need confidence in failure handling
Static analysis and linting	Null safety violations, type errors, security vulnerabilities, dead code	From day one - these are cheap and fast

For more examples of mapping defect origins to detection methods and systemic corrections, see the CD Defect Detection and Remediation Patterns.

Measuring Success

Metric	Target	Why It Matters
Deterministic suite duration	< 10 minutes	Enables fast feedback loops
Flaky test count	0 in pipeline-gating suite	Maintains trust in test results
External dependencies in gating tests	0	Ensures deployment independence
Test coverage trend	Increasing	Confirms new code is being tested
Defect escape rate	Decreasing	Confirms tests catch real bugs
Contract test freshness	All passing within last 24 hours	Confirms test doubles are current

Next Step

With a reliable test suite in place, automate your build process so that building, testing, and packaging happens with a single command. Continue to Build Automation.

3.2.3 - Build Automation

Automate your build process so a single command builds, tests, and packages your application.

Phase 1 - Foundations | Adapted from Dojo Consortium

Build automation is the mechanism that turns trunk-based development and testing into a continuous integration loop. If you cannot build, test, and package your application with a single command, you cannot automate your pipeline. This page covers the practices that make your build reproducible, fast, and trustworthy.

What Build Automation Means

Build automation is the practice of scripting every step required to go from source code to a deployable artifact. A single command - or a single CI trigger - should execute the entire sequence:

Compile the source code (if applicable)
Run all automated tests
Package the application into a deployable artifact (container image, binary, archive)
Report the result (pass or fail, with details)

No manual steps. No “run this script, then do that.” No tribal knowledge about which flags to set or which order to run things. One command, every time, same result.

The Litmus Test

Ask yourself: “Can a new team member clone the repository and produce a deployable artifact with a single command within 15 minutes?”

If the answer is no, your build is not fully automated.

Why Build Automation Matters for CD

CD Requirement	How Build Automation Supports It
Reproducibility	The same commit always produces the same artifact, on any machine
Speed	Automated builds can be optimized, cached, and parallelized
Confidence	If the build passes, the artifact is trustworthy
Developer experience	Developers run the same build locally that CI runs, eliminating “works on my machine”
Pipeline foundation	The CI/CD pipeline is just the build running automatically on every commit

Without build automation, every other practice in this guide breaks down. You cannot have continuous integration if the build requires manual intervention. You cannot have a deterministic pipeline if the build produces different results depending on who runs it.

Key Practices

1. Version-Controlled Build Scripts

Your build configuration lives in the same repository as your code. It is versioned, reviewed, and tested alongside the application.

What belongs in version control:

Build scripts (Makefile, build.gradle, package.json scripts, Dockerfile)
Dependency manifests (requirements.txt, go.mod, pom.xml, package-lock.json)
CI/CD pipeline definitions (.github/workflows, .gitlab-ci.yml, Jenkinsfile)
Environment setup scripts (docker-compose.yml for local development)

What does not belong in version control:

Secrets and credentials (use secret management tools)
Environment-specific configuration values (use environment variables or config management)
Generated artifacts (build outputs, compiled binaries)

Anti-pattern: Build instructions that exist only in a wiki, a Confluence page, or one developer’s head. If the build steps are not in the repository, they will drift from reality.

2. Dependency Management

All dependencies must be declared explicitly and resolved deterministically.

Practices:

Lock files: Use lock files (package-lock.json, Pipfile.lock, go.sum) to pin exact dependency versions. Check lock files into version control.
Reproducible resolution: Running the dependency install twice should produce identical results.
No undeclared dependencies: Your build should not rely on tools or libraries that happen to be installed on the build machine. If you need it, declare it.
Dependency scanning: Automate vulnerability scanning of dependencies as part of the build. Do not wait for a separate security review.

Anti-pattern: “It builds on Jenkins because Jenkins has Java 11 installed, but the Dockerfile uses Java 17.” The build must declare and control its own runtime.

3. Build Caching

Fast builds keep developers in flow. Caching is the primary mechanism for build speed.

What to cache:

Dependencies: Download once, reuse across builds. Most build tools (npm, Maven, Gradle, pip) support a local cache.
Compilation outputs: Incremental compilation avoids rebuilding unchanged modules.
Docker layers: Structure your Dockerfile so that rarely-changing layers (OS, dependencies) are cached and only the application code layer is rebuilt.
Test fixtures: Prebuilt test data or container images used by tests.

Guidelines:

Cache aggressively for local development and CI
Invalidate caches when dependencies or build configuration change
Do not cache test results - tests must always run

4. Single Build Script Entry Point

Developers, CI, and CD should all use the same entry point.

# Example: Makefile as the single entry point

.PHONY: build test package all

all: build test package

build:
	./gradlew compileJava

test:
	./gradlew test

package:
	docker build -t myapp:$(GIT_SHA) .

clean:
	./gradlew clean
	docker rmi myapp:$(GIT_SHA) || true

The CI server runs make all. A developer runs make all. The result is the same. There is no separate “CI build script” that diverges from what developers run locally.

5. Artifact Versioning

Every build artifact must be traceable to the exact commit that produced it.

Practices:

Tag artifacts with the Git commit SHA or a build number derived from it
Store build metadata (commit, branch, timestamp, builder) in the artifact or alongside it
Never overwrite an existing artifact - if the version exists, the artifact is immutable

This becomes critical in Phase 2 when you establish immutable artifact practices.

CI Server Setup Basics

The CI server is the mechanism that runs your build automatically. In Phase 1, the setup is straightforward:

What the CI Server Does

Watches the trunk for new commits
Runs the build (the same command a developer would run locally)
Reports the result (pass/fail, test results, build duration)
Notifies the team if the build fails

Minimum CI Configuration

Regardless of which CI tool you use (GitHub Actions, GitLab CI, Jenkins, CircleCI), the configuration follows the same pattern:

# Conceptual CI configuration (adapt to your tool)
trigger:
  branch: main  # Run on every commit to trunk

steps:
  - checkout: source code
  - install: dependencies
  - run: build
  - run: tests
  - run: package
  - report: test results and build status

CI Principles for Phase 1

Run on every commit. Not nightly, not weekly, not “when someone remembers.” Every commit to trunk triggers a build.
Keep the build green. A failing build is the team’s top priority. Work stops until trunk is green again. (See Working Agreements.)
Run the same build everywhere. The CI server runs the same script as local development. No CI-only steps that developers cannot reproduce.
Fail fast. Run the fastest checks first (compilation, unit tests) before the slower ones (integration tests, packaging).

Build Time Targets

Build speed directly affects developer productivity and integration frequency. If the build takes 30 minutes, developers will not integrate multiple times per day.

Build Phase	Target	Rationale
Compilation	< 1 minute	Developers need instant feedback on syntax and type errors
Unit tests	< 3 minutes	Fast enough to run before every commit
Integration tests	< 5 minutes	Must complete before the developer context-switches
Full build (compile + test + package)	< 10 minutes	The outer bound for fast feedback

If Your Build Is Too Slow

Slow builds are a common constraint that blocks CD adoption. Address them systematically:

Profile the build. Identify which steps take the most time. Optimize the bottleneck, not everything.
Parallelize tests. Most test frameworks support parallel execution. Run independent test suites concurrently.
Use build caching. Avoid recompiling or re-downloading unchanged dependencies.
Split the build. Run fast checks (lint, compile, unit tests) as a “fast feedback” stage. Run slower checks (integration tests, security scans) as a second stage.
Upgrade build hardware. Sometimes the fastest optimization is more CPU and RAM.

The target is under 10 minutes for the feedback loop that developers use on every commit. Longer-running validation (E2E tests, performance tests) can run in a separate stage.

Common Anti-Patterns

Manual Build Steps

Symptom: The build process includes steps like “open this tool and click Run” or “SSH into the build server and execute this script.”

Problem: Manual steps are error-prone, slow, and cannot be parallelized or cached. They are the single biggest obstacle to build automation.

Fix: Script every step. If a human must perform the step today, write a script that performs it tomorrow.

Environment-Specific Builds

Symptom: The build produces different artifacts for different environments (dev, staging, production). Or the build only works on specific machines because of pre-installed tools.

Problem: Environment-specific builds mean you are not testing the same artifact you deploy. Bugs that appear in production but not in staging become impossible to diagnose.

Fix: Build one artifact and configure it per environment at deployment time. The artifact is immutable; the configuration is external. (See Application Config in Phase 2.)

Build Scripts That Only Run in CI

Symptom: The CI pipeline has build steps that developers cannot run locally. Local development uses a different build process.

Problem: Developers cannot reproduce CI failures locally, leading to slow debugging cycles and “push and pray” development.

Fix: Use a single build entry point (Makefile, build script) that both CI and developers use. CI configuration should only add triggers and notifications, not build logic.

Missing Dependency Pinning

Symptom: Builds break randomly because a dependency released a new version overnight.

Problem: Without pinned dependencies, the build is non-deterministic. The same code can produce different results on different days.

Fix: Use lock files. Pin all dependency versions. Update dependencies intentionally, not accidentally.

Long Build Queues

Symptom: Developers commit to trunk, but the build does not run for 20 minutes because the CI server is processing a queue.

Problem: Delayed feedback defeats the purpose of CI. If developers do not see the result of their commit for 30 minutes, they have already moved on.

Fix: Ensure your CI infrastructure can handle your team’s commit frequency. Use parallel build agents. Prioritize builds on the main branch.

Measuring Success

Metric	Target	Why It Matters
Build duration	< 10 minutes	Enables fast feedback and frequent integration
Build success rate	> 95%	Indicates reliable, reproducible builds
Time from commit to build result	< 15 minutes (including queue time)	Measures the full feedback loop
Developer ability to build locally	100% of team	Confirms the build is portable and documented

Next Step

With build automation in place, you can build, test, and package your application reliably. The next foundation is ensuring that the work you integrate daily is small enough to be safe. Continue to Work Decomposition.

3.2.4 - Work Decomposition

Break features into small, deliverable increments that can be completed in 2 days or less.

Phase 1 - Foundations | Adapted from Dojo Consortium

Trunk-based development requires daily integration, and daily integration requires small work. If a feature takes two weeks to build, you cannot integrate it daily without decomposing it first. This page covers the techniques for breaking work into small, deliverable increments that flow through your pipeline continuously.

Why Small Work Matters for CD

Continuous delivery depends on a simple equation: small changes, integrated frequently, are safer than large changes integrated rarely.

Every practice in Phase 1 reinforces this:

Trunk-based development requires that you integrate at least daily. You cannot integrate a two-week feature daily unless you decompose it.
Testing fundamentals work best when each change is small enough to test thoroughly.
Code review is fast when the change is small. A 50-line change can be reviewed in minutes. A 2,000-line change takes hours - if it gets reviewed at all.

The data supports this. The DORA research consistently shows that smaller batch sizes correlate with higher delivery performance. Small changes have:

Lower risk: If a small change breaks something, the blast radius is limited, and the cause is obvious.
Faster feedback: A small change gets through the pipeline quickly. You learn whether it works today, not next week.
Easier rollback: Rolling back a 50-line change is straightforward. Rolling back a 2,000-line change often requires a new deployment.
Better flow: Small work items move through the system predictably. Large work items block queues and create bottlenecks.

The 2-Day Rule

If a work item takes longer than 2 days to complete, it is too big.

This is not arbitrary. Two days gives you at least one integration to trunk per day (the minimum for TBD) and allows for the natural rhythm of development: plan, implement, test, integrate, move on.

When a developer says “this will take a week,” the answer is not “go faster.” The answer is “break it into smaller pieces.”

What “Complete” Means

A work item is complete when it is:

Integrated to trunk
All tests pass
The change is deployable (even if the feature is not yet user-visible)
It meets the Definition of Done

If a story requires a feature flag to hide incomplete user-facing behavior, that is fine. The code is still integrated, tested, and deployable.

Story Slicing Techniques

Story slicing is the practice of breaking user stories into the smallest possible increments that still deliver value or make progress toward delivering value.

The INVEST Criteria

Good stories follow INVEST:

Criterion	Meaning	Why It Matters for CD
Independent	Can be developed and deployed without waiting for other stories	Enables parallel work and avoids blocking
Negotiable	Details can be discussed and adjusted	Allows the team to find the smallest valuable slice
Valuable	Delivers something meaningful to the user or the system	Prevents “technical stories” that do not move the product forward
Estimable	Small enough that the team can reasonably estimate it	Large stories are unestimable because they hide unknowns
Small	Completable within 2 days	Enables daily integration and fast feedback
Testable	Has clear acceptance criteria that can be automated	Supports the testing foundation

Vertical Slicing

The most important slicing technique for CD is vertical slicing: cutting through all layers of the application to deliver a thin but complete slice of functionality.

Vertical slice (correct):

Horizontal slice (anti-pattern):

Slicing Strategies

When a story feels too big, apply one of these strategies:

Strategy	How It Works	Example
By workflow step	Implement one step of a multi-step process	“User can add items to cart” (before “user can checkout”)
By business rule	Implement one rule at a time	“Orders over $100 get free shipping” (before “orders ship to international addresses”)
By data variation	Handle one data type first	“Support credit card payments” (before “support PayPal”)
By operation	Implement CRUD operations separately	“Create a new customer” (before “edit customer” or “delete customer”)
By performance	Get it working first, optimize later	“Search returns results” (before “search returns results in under 200ms”)
By platform	Support one platform first	“Works on desktop web” (before “works on mobile”)
Happy path first	Implement the success case first	“User completes checkout” (before “user sees error when payment fails”)

Example: Decomposing a Feature

Original story (too big):

Decomposed into vertical slices:

“User can view their current profile information” (read-only display)
“User can update their name” (simplest edit)
“User can update their email with verification” (adds email flow)
“User can upload an avatar image” (adds file handling)
“User can change their password” (adds security validation)
“User can configure notification preferences” (adds preferences)
“User can enable two-factor authentication” (adds 2FA flow)

Each slice is independently deployable, testable, and completable within 2 days. Each delivers incremental value. The feature is built up over a series of small deliveries rather than one large batch.

BDD as a Decomposition Tool

Behavior-Driven Development (BDD) is not just a testing practice - it is a powerful tool for decomposing work into small, clear increments.

Three Amigos

Before work begins, hold a brief “Three Amigos” session with three perspectives:

Business/Product: What should this feature do? What is the expected behavior?
Development: How will we build it? What are the technical considerations?
Testing: How will we verify it? What are the edge cases?

This 15-30 minute conversation accomplishes two things:

Shared understanding: Everyone agrees on what “done” looks like before work begins.
Natural decomposition: Discussing specific scenarios reveals natural slice boundaries.

Specification by Example

Write acceptance criteria as concrete examples, not abstract requirements.

Abstract (hard to slice):

Concrete (easy to slice):

Each concrete example can become its own story or task. The scope is clear, the acceptance criteria are testable, and the work is small.

Given-When-Then Format

Structure acceptance criteria in Given-When-Then format to make them executable:

Feature: User login

  Scenario: Successful login with valid credentials
    Given a registered user with email "user@example.com"
    When they enter their correct password and click "Log in"
    Then they are redirected to the dashboard

  Scenario: Failed login with wrong password
    Given a registered user with email "user@example.com"
    When they enter an incorrect password and click "Log in"
    Then they see the message "Invalid email or password"
    And they remain on the login page

Each scenario is a natural unit of work. Implement one scenario at a time, integrate to trunk after each one.

Task Decomposition Within Stories

Even well-sliced stories may contain multiple tasks. Decompose stories into tasks that can be completed and integrated independently.

Example story: “User can update their name”

Tasks:

Add the name field to the profile API endpoint (backend change, integration test)
Add the name field to the profile form (frontend change, unit test)
Connect the form to the API endpoint (integration, E2E test)

Each task results in a commit to trunk. The story is completed through a series of small integrations, not one large merge.

Guidelines for task decomposition:

Each task should take hours, not days
Each task should leave trunk in a working state after integration
Tasks should be ordered so that the simplest changes come first
If a task requires a feature flag or stub to be integrated safely, that is fine

Common Anti-Patterns

Horizontal Slicing

Symptom: Stories are organized by architectural layer: “build the database schema,” “build the API,” “build the UI.”

Problem: No individual slice is deployable or testable end-to-end. Integration happens at the end, which is where bugs are found and schedules slip.

Fix: Slice vertically. Every story should touch all the layers needed to deliver a thin slice of complete functionality.

Technical Stories

Symptom: The backlog contains stories like “refactor the database access layer” or “upgrade to React 18” that do not deliver user-visible value.

Problem: Technical work is important, but when it is separated from feature work, it becomes hard to prioritize and easy to defer. It also creates large, risky changes.

Fix: Embed technical improvements in feature stories. Refactor as you go. If a technical change is necessary, tie it to a specific business outcome and keep it small enough to complete in 2 days.

Stories That Are Really Epics

Symptom: A story has 10+ acceptance criteria, or the estimate is “8 points” or “2 weeks.”

Problem: Large stories hide unknowns, resist estimation, and cannot be integrated daily.

Fix: If a story has more than 3-5 acceptance criteria, it is an epic. Break it into smaller stories using the slicing strategies above.

Splitting by Role Instead of by Behavior

Symptom: Separate stories for “frontend developer builds the UI” and “backend developer builds the API.”

Problem: This creates handoff dependencies and delays integration. The feature is not testable until both stories are complete.

Fix: Write stories from the user’s perspective. The same developer (or pair) implements the full vertical slice.

Deferring “Edge Cases” Indefinitely

Symptom: The team builds the happy path and creates a backlog of “handle error case X” stories that never get prioritized.

Problem: Error handling is not optional. Unhandled edge cases become production incidents.

Fix: Include the most important error cases in the initial story decomposition. Use the “happy path first” slicing strategy, but schedule edge case stories immediately after, not “someday.”

Measuring Success

Metric	Target	Why It Matters
Story cycle time	< 2 days from start to trunk	Confirms stories are small enough
Development cycle time	Decreasing	Shows improved flow from smaller work
Stories completed per week	Increasing (with same team size)	Indicates better decomposition and less rework
Work in progress	Decreasing	Fewer large stories blocking the pipeline

Next Step

Small, well-decomposed work flows through the system quickly - but only if code review does not become a bottleneck. Continue to Code Review to learn how to keep review fast and effective.

3.2.5 - Code Review

Streamline code review to provide fast feedback without blocking flow.

Phase 1 - Foundations | Adapted from Dojo Consortium

Code review is essential for quality, but it is also the most common bottleneck in teams adopting trunk-based development. If reviews take days, daily integration is impossible. This page covers review techniques that maintain quality while enabling the flow that CD requires.

Why Code Review Matters for CD

Code review serves multiple purposes:

Defect detection: A second pair of eyes catches bugs that the author missed.
Knowledge sharing: Reviews spread understanding of the codebase across the team.
Consistency: Reviews enforce coding standards and architectural patterns.
Mentoring: Junior developers learn by having their code reviewed and by reviewing others’ code.

These are real benefits. The challenge is that traditional code review - open a pull request, wait for someone to review it, address comments, wait again - is too slow for CD.

In a CD workflow, code review must happen within minutes or hours, not days. The review is still rigorous, but the process is designed for speed.

The Core Tension: Quality vs. Flow

Traditional teams optimize review for thoroughness: detailed comments, multiple reviewers, extensive back-and-forth. This produces high-quality reviews but blocks flow.

CD teams optimize review for speed without sacrificing the quality that matters. The key insight is that most of the quality benefit of code review comes from small, focused reviews done quickly, not from exhaustive reviews done slowly.

Traditional Review	CD-Compatible Review
Review happens after the feature is complete	Review happens continuously throughout development
Large diffs (hundreds or thousands of lines)	Small diffs (< 200 lines, ideally < 50)
Multiple rounds of feedback and revision	One round, or real-time feedback during pairing
Review takes 1-3 days	Review takes minutes to a few hours
Review is asynchronous by default	Review is synchronous by preference
2+ reviewers required	1 reviewer (or pairing as the review)

Synchronous vs. Asynchronous Review

Synchronous Review (Preferred for CD)

In synchronous review, the reviewer and author are engaged at the same time. Feedback is immediate. Questions are answered in real time. The review is done when the conversation ends.

Methods:

Pair programming: Two developers work on the same code at the same time. Review is continuous. There is no separate review step because the code was reviewed as it was written.
Mob programming: The entire team (or a subset) works on the same code together. Everyone reviews in real time.
Over-the-shoulder review: The author walks the reviewer through the change in person or on a video call. The reviewer asks questions and provides feedback immediately.

Advantages for CD:

Zero wait time between “ready for review” and “review complete”
Higher bandwidth communication (tone, context, visual cues) catches more issues
Immediate resolution of questions - no async back-and-forth
Knowledge transfer happens naturally through the shared work

Asynchronous Review (When Necessary)

Sometimes synchronous review is not possible - time zones, schedules, or team preferences may require asynchronous review. This is fine, but it must be fast.

Rules for async review in a CD workflow:

Review within 2 hours. If a pull request sits for a day, it blocks integration. Set a team working agreement: “pull requests are reviewed within 2 hours during working hours.”
Keep changes small. A 50-line change can be reviewed in 5 minutes. A 500-line change takes an hour and reviewers procrastinate on it.
Use draft PRs for early feedback. If you want feedback on an approach before the code is complete, open a draft PR. Do not wait until the change is “perfect.”
Avoid back-and-forth. If a comment requires discussion, move to a synchronous channel (call, chat). Async comment threads that go 5 rounds deep are a sign the change is too large or the design was not discussed upfront.

Review Techniques Compatible with TBD

Pair Programming as Review

When two developers pair on a change, the code is reviewed as it is written. There is no separate review step, no pull request waiting for approval, and no delay to integration.

How it works with TBD:

Two developers sit together (physically or via screen share)
They discuss the approach, write the code, and review each other’s decisions in real time
When the change is ready, they commit to trunk together
Both developers are accountable for the quality of the code

When to pair:

New or unfamiliar areas of the codebase
Changes that affect critical paths
When a junior developer is working on a change (pairing doubles as mentoring)
Any time the change involves design decisions that benefit from discussion

Pair programming satisfies most organizations’ code review requirements because two developers have actively reviewed and approved the code.

Mob Programming as Review

Mob programming extends pairing to the whole team. One person drives (types), one person navigates (directs), and the rest observe and contribute.

When to mob:

Establishing new patterns or architectural decisions
Complex changes that benefit from multiple perspectives
Onboarding new team members to the codebase
Working through particularly difficult problems

Mob programming is intensive but highly effective. Every team member understands the code, the design decisions, and the trade-offs.

Rapid Async Review

For teams that use pull requests, rapid async review adapts the pull request workflow for CD speed.

Practices:

Auto-assign reviewers. Do not wait for someone to volunteer. Use tools to automatically assign a reviewer when a PR is opened.
Keep PRs small. Target < 200 lines of changed code. Smaller PRs get reviewed faster and more thoroughly.
Provide context. Write a clear PR description that explains what the change does, why it is needed, and how to verify it. A good description reduces review time dramatically.
Use automated checks. Run linting, formatting, and tests before the human review. The reviewer should focus on logic and design, not style.
Approve and merge quickly. If the change looks correct, approve it. Do not hold it for nitpicks. Nitpicks can be addressed in a follow-up commit.

What to Review

Not everything in a code change deserves the same level of scrutiny. Focus reviewer attention where it matters most.

High Priority (Reviewer Should Focus Here)

Behavior correctness: Does the code do what it is supposed to do? Are edge cases handled?
Security: Does the change introduce vulnerabilities? Are inputs validated? Are secrets handled properly?
Clarity: Can another developer understand this code in 6 months? Are names clear? Is the logic straightforward?
Test coverage: Are the new behaviors tested? Do the tests verify the right things?
API contracts: Do changes to public interfaces maintain backward compatibility? Are they documented?
Error handling: What happens when things go wrong? Are errors caught, logged, and surfaced appropriately?

Low Priority (Automate Instead of Reviewing)

Code style and formatting: Use automated formatters (Prettier, Black, gofmt). Do not waste reviewer time on indentation and bracket placement.
Import ordering: Automate with linting rules.
Naming conventions: Enforce with lint rules where possible. Only flag naming in review if it genuinely harms readability.
Unused variables or imports: Static analysis tools catch these instantly.
Consistent patterns: Where possible, encode patterns in architecture decision records and lint rules rather than relying on reviewers to catch deviations.

Rule of thumb: If a style or convention issue can be caught by a machine, do not ask a human to catch it. Reserve human attention for the things machines cannot evaluate: correctness, design, clarity, and security.

Review Scope for Small Changes

In a CD workflow, most changes are small - tens of lines, not hundreds. This changes the economics of review.

Change Size	Expected Review Time	Review Depth
< 20 lines	2-5 minutes	Quick scan: is it correct? Any security issues?
20-100 lines	5-15 minutes	Full review: behavior, tests, clarity
100-200 lines	15-30 minutes	Detailed review: design, contracts, edge cases
> 200 lines	Consider splitting the change	Large changes get superficial reviews

Research consistently shows that reviewer effectiveness drops sharply after 200-400 lines. If you are regularly reviewing changes larger than 200 lines, the problem is not the review process - it is the work decomposition.

Working Agreements for Review SLAs

Establish clear team agreements about review expectations. Without explicit agreements, review latency will drift based on individual habits.

Recommended Review Agreements

Agreement	Target
Response time	Review within 2 hours during working hours
Reviewer count	1 reviewer (or pairing as the review)
PR size	< 200 lines of changed code
Blocking issues only	Only block a merge for correctness, security, or significant design issues
Nitpicks	Use a “nit:” prefix. Nitpicks are suggestions, not merge blockers
Stale PRs	PRs open for > 24 hours are escalated to the team
Self-review	Author reviews their own diff before requesting review

How to Enforce Review SLAs

Track review turnaround time. If it consistently exceeds 2 hours, discuss it in retrospectives.
Make review a first-class responsibility, not something developers do “when they have time.”
If a reviewer is unavailable, any other team member can review. Do not create single-reviewer dependencies.
Consider pairing as the default and async review as the exception. This eliminates the review bottleneck entirely.

Code Review and Trunk-Based Development

Code review and TBD work together, but only if review does not block integration. Here is how to reconcile them:

TBD Requirement	How Review Adapts
Integrate to trunk at least daily	Reviews must complete within hours, not days
Branches live < 24 hours	PRs are opened and merged within the same day
Trunk is always releasable	Reviewers focus on correctness, not perfection
Small, frequent changes	Small changes are reviewed quickly and thoroughly

If your team finds that review is the bottleneck preventing daily integration, the most effective solution is to adopt pair programming. It eliminates the review step entirely by making review continuous.

Measuring Success

Metric	Target	Why It Matters
Review turnaround time	< 2 hours	Prevents review from blocking integration
PR size (lines changed)	< 200 lines	Smaller PRs get faster, more thorough reviews
PR age at merge	< 24 hours	Aligns with TBD branch age constraint
Review rework cycles	< 2 rounds	Multiple rounds indicate the change is too large or design was not discussed upfront

Next Step

Code review practices need to be codified in team agreements alongside other shared commitments. Continue to Working Agreements to establish your team’s definitions of done, ready, and CI practice.

3.2.6 - Working Agreements

Establish shared definitions of done and ready to align the team on quality and process.

Phase 1 - Foundations | Adapted from Dojo Consortium

The practices in Phase 1 - trunk-based development, testing, small work, and fast review - only work when the whole team commits to them. Working agreements make that commitment explicit. This page covers the key agreements a team needs before moving to pipeline automation in Phase 2.

Why Working Agreements Matter

A working agreement is a shared commitment that the team creates, owns, and enforces together. It is not a policy imposed from outside. It is the team’s own answer to the question: “How do we work together?”

Without working agreements, CD practices drift. One developer integrates daily; another keeps a branch for a week. One developer fixes a broken build immediately; another waits until after lunch. These inconsistencies compound. Within weeks, the team is no longer practicing CD - they are practicing individual preferences.

Working agreements prevent this drift by making expectations explicit. When everyone agrees on what “done” means, what “ready” means, and how CI works, the team can hold each other accountable without conflict.

Definition of Done

The Definition of Done (DoD) is the team’s shared standard for when a work item is complete. For CD, the Definition of Done must include deployment.

Minimum Definition of Done for CD

A work item is done when all of the following are true:

Code is integrated to trunk
All automated tests pass
Code has been reviewed (via pairing, mob, or pull request)
The change is deployable to production
No known defects are introduced
Relevant documentation is updated (API docs, runbooks, etc.)
Feature flags are in place for incomplete user-facing features

Why “Deployed to Production” Matters

Many teams define “done” as “code is merged.” This creates a gap between “done” and “delivered.” Work accumulates in a staging environment, waiting for a release. Risk grows with each unreleased change.

In a CD organization, “done” means the change is in production (or ready to be deployed to production at any time). This is the ultimate test of completeness: the change works in the real environment, with real data, under real load.

In Phase 1, you may not yet have the pipeline to deploy every change to production automatically. That is fine - your DoD should still include “deployable to production” as the standard, even if the deployment step is not yet automated. The pipeline work in Phase 2 will close that gap.

Extending Your Definition of Done

As your CD maturity grows, extend the DoD:

Phase	Addition to DoD
Phase 1 (Foundations)	Code integrated to trunk, tests pass, reviewed, deployable
Phase 2 (Pipeline)	Artifact built and validated by the pipeline
Phase 3 (Optimize)	Change deployed to production behind a feature flag
Phase 4 (Deliver on Demand)	Change deployed to production and monitored

Definition of Ready

The Definition of Ready (DoR) answers: “When is a work item ready to be worked on?” Pulling unready work into development creates waste - unclear requirements lead to rework, missing acceptance criteria lead to untestable changes, and oversized stories lead to long-lived branches.

Minimum Definition of Ready for CD

A work item is ready when all of the following are true:

Acceptance criteria are defined and specific (using Given-When-Then or equivalent)
The work item is small enough to complete in 2 days or less
The work item is testable - the team knows how to verify it works
Dependencies are identified and resolved (or the work item is independent)
The team has discussed the work item (Three Amigos or equivalent)
The work item is estimated (or the team has agreed estimation is unnecessary for items this small)

Common Mistakes with Definition of Ready

Making it too rigid. The DoR is a guideline, not a gate. If the team agrees a work item is understood well enough, it is ready. Do not use the DoR to avoid starting work.
Requiring design documents. For small work items (< 2 days), a conversation and acceptance criteria are sufficient. Formal design documents are for larger initiatives.
Skipping the conversation. The DoR is most valuable as a prompt for discussion, not as a checklist. The Three Amigos conversation matters more than the checkboxes.

CI Working Agreement

The CI working agreement codifies how the team practices continuous integration. This is the most operationally critical working agreement for CD.

The CI Agreement

The team agrees to the following practices:

Integration:

Every developer integrates to trunk at least once per day
Branches (if used) live for less than 24 hours
No long-lived feature, development, or release branches

Build:

All tests must pass before merging to trunk
The build runs on every commit to trunk
Build results are visible to the entire team

Broken builds:

A broken build is the team’s top priority - it is fixed before any new work begins
The developer(s) who broke the build are responsible for fixing it immediately
If the fix will take more than 10 minutes, revert the change and fix it offline
No one commits to a broken trunk (except to fix the break)

Work in progress:

Finishing existing work takes priority over starting new work
The team limits work in progress to maintain flow
If a developer is blocked, they help a teammate before starting a new story

Why “Broken Build = Top Priority”

This is the single most important CI agreement. When the build is broken:

No one can integrate safely. Changes are stacking up.
Trunk is not releasable. The team has lost its safety net.
Every minute the build stays broken, the team accumulates risk.

“Fix the build” is not a suggestion. It is an agreement that the team enforces collectively. If the build is broken and someone starts a new feature instead of fixing it, the team should call that out. This is not punitive - it is the team protecting its own ability to deliver.

The Revert Rule

If a broken build cannot be fixed within 10 minutes, revert the offending commit and fix the issue on a branch. This keeps trunk green and unblocks the rest of the team. The developer who made the change is not being punished - they are protecting the team’s flow.

Reverting feels uncomfortable at first. Teams worry about “losing work.” But a reverted commit is not lost - the code is still in the Git history. The developer can re-apply their change after fixing the issue. The alternative - a broken trunk for hours while someone debugs - is far more costly.

How Working Agreements Support the CD Migration

Each working agreement maps directly to a Phase 1 practice:

Practice	Supporting Agreement
Trunk-based development	CI agreement: daily integration, branch age < 24h
Testing fundamentals	DoD: all tests pass. CI: tests pass before merge
Build automation	CI: build runs on every commit. Broken build = top priority
Work decomposition	DoR: work items < 2 days. WIP limits
Code review	CI: review within 2 hours. DoD: code reviewed

Without these agreements, individual practices exist in isolation. Working agreements connect them into a coherent way of working.

Template: Create Your Own Working Agreements

Use this template as a starting point. Customize it for your team’s context. The specific targets may differ, but the structure should remain.

Team Working Agreement Template

# [Team Name] Working Agreement
Date: [Date]
Participants: [All team members]

## Definition of Done
A work item is done when:
- [ ] Code is integrated to trunk
- [ ] All automated tests pass
- [ ] Code has been reviewed (method: [pair / mob / PR])
- [ ] The change is deployable to production
- [ ] No known defects are introduced
- [ ] [Add team-specific criteria]

## Definition of Ready
A work item is ready when:
- [ ] Acceptance criteria are defined (Given-When-Then)
- [ ] The item can be completed in [X] days or less
- [ ] The item is testable
- [ ] Dependencies are identified
- [ ] The team has discussed the item
- [ ] [Add team-specific criteria]

## CI Practices
- Integration frequency: at least [X] per developer per day
- Maximum branch age: [X] hours
- Review turnaround: within [X] hours
- Broken build response: fix within [X] minutes or revert
- WIP limit: [X] items per developer

## Review Practices
- Default review method: [pair / mob / async PR]
- PR size limit: [X] lines
- Review focus: [correctness, security, clarity]
- Style enforcement: [automated via linting]

## Meeting Cadence
- Standup: [time, frequency]
- Retrospective: [frequency]
- Working agreement review: [frequency, e.g., monthly]

## Agreement Review
This agreement is reviewed and updated [monthly / quarterly].
Any team member can propose changes at any time.
All changes require team consensus.

Tips for Creating Working Agreements

Include everyone. Every team member should participate in creating the agreement. Agreements imposed by a manager or tech lead are policies, not agreements.
Start simple. Do not try to cover every scenario. Start with the essentials (DoD, DoR, CI) and add specifics as the team identifies gaps.
Make them visible. Post the agreements where the team sees them daily - on a team wiki, in the team channel, or on a physical board.
Review regularly. Agreements should evolve as the team matures. Review them monthly. Remove agreements that are second nature. Add agreements for new challenges.
Enforce collectively. Working agreements are only effective if the team holds each other accountable. This is a team responsibility, not a manager responsibility.
Start with agreements you can keep. If the team is currently integrating once a week, do not agree to integrate three times daily. Agree to integrate daily, practice for a month, then tighten.

Measuring Success

Metric	Target	Why It Matters
Agreement adherence	Team self-reports > 80% adherence	Indicates agreements are realistic and followed
Agreement review frequency	Monthly	Ensures agreements stay relevant
Integration frequency	Meets CI agreement target	Validates the CI working agreement
Broken build fix time	Meets CI agreement target	Validates the broken build response agreement

Next Step

With working agreements in place, your team has established the foundations for continuous delivery: daily integration, reliable testing, automated builds, small work, fast review, and shared commitments.

You are ready to move to Phase 2: Pipeline, where you will build the automated path from commit to production.

3.2.7 - Everything as Code

Every artifact that defines your system - infrastructure, pipelines, configuration, database schemas, monitoring - belongs in version control and is delivered through pipelines.

Phase 1 - Foundations

If it is not in version control, it does not exist. If it is not delivered through a pipeline, it is a manual step. Manual steps block continuous delivery. This page establishes the principle that everything required to build, deploy, and operate your system is defined as code, version controlled, reviewed, and delivered through the same automated pipelines as your application.

The Principle

Continuous delivery requires that any change to your system - application code, infrastructure, pipeline configuration, database schema, monitoring rules, security policies - can be made through a single, consistent process: change the code, commit, let the pipeline deliver it.

When something is defined as code:

It is version controlled. You can see who changed what, when, and why. You can revert any change. You can trace any production state to a specific commit.
It is reviewed. Changes go through the same review process as application code. A second pair of eyes catches mistakes before they reach production.
It is tested. Automated validation catches errors before deployment. Linting, dry-runs, and policy checks apply to infrastructure the same way unit tests apply to application code.
It is reproducible. You can recreate any environment from scratch. Disaster recovery is “re-run the pipeline,” not “find the person who knows how to configure the server.”
It is delivered through a pipeline. No SSH, no clicking through UIs, no manual steps. The pipeline is the only path to production for everything, not just application code.

When something is not defined as code, it is a liability. It cannot be reviewed, tested, or reproduced. It exists only in someone’s head, a wiki page that is already outdated, or a configuration that was applied manually and has drifted from any documented state.

What “Everything” Means

Application code

This is where most teams start, and it is the least controversial. Your application source code is in version control, built and tested by a pipeline, and deployed as an immutable artifact.

If your application code is not in version control, start here. Nothing else in this page matters until this is in place.

Infrastructure

Every server, network, database instance, load balancer, DNS record, and cloud resource should be defined in code and provisioned through automation.

What this looks like:

Cloud resources defined in Terraform, Pulumi, CloudFormation, or similar tools
Server configuration managed by Ansible, Chef, Puppet, or container images
Network topology, firewall rules, and security groups defined declaratively
Environment creation is a pipeline run, not a ticket to another team

What this replaces:

Clicking through cloud provider consoles to create resources
SSH-ing into servers to install packages or change configuration
Filing tickets for another team to provision an environment
“Snowflake” servers that were configured by hand and nobody knows how to recreate

Why it matters for CD: If creating or modifying an environment requires manual steps, your deployment frequency is limited by the availability and speed of the person who performs those steps. If a production server fails and you cannot recreate it from code, your mean time to recovery is measured in hours or days instead of minutes.

Pipeline definitions

Your CI/CD pipeline configuration belongs in the same repository as the code it builds and deploys. The pipeline is code, not a configuration applied through a UI.

What this looks like:

Pipeline definitions in .github/workflows/, .gitlab-ci.yml, Jenkinsfile, or equivalent
Pipeline changes go through the same review process as application code
Pipeline behavior is deterministic - the same commit always produces the same pipeline behavior
Teams can modify their own pipelines without filing tickets

What this replaces:

Pipeline configuration maintained through a Jenkins UI that nobody is allowed to touch
A “platform team” that owns all pipeline definitions and queues change requests
Pipeline behavior that varies depending on server state or installed plugins

Why it matters for CD: The pipeline is the path to production. If the pipeline itself cannot be changed through a reviewed, automated process, it becomes a bottleneck and a risk. Pipeline changes should flow with the same speed and safety as application changes.

Database schemas and migrations

Database schema changes should be defined as versioned migration scripts, stored in version control, and applied through the pipeline.

What this looks like:

Migration scripts in the repository (using tools like Flyway, Liquibase, Alembic, or ActiveRecord migrations)
Every schema change is a numbered, ordered migration that can be applied and rolled back
Migrations run as part of the deployment pipeline, not as a manual step
Schema changes follow the expand-then-contract pattern: add the new column, deploy code that uses it, then remove the old column in a later migration

What this replaces:

A DBA manually applying SQL scripts during a maintenance window
Schema changes that are “just done in production” and not tracked anywhere
Database state that has drifted from what is defined in any migration script

Why it matters for CD: Database changes are one of the most common reasons teams cannot deploy continuously. If schema changes require manual intervention, coordinated downtime, or a separate approval process, they become a bottleneck that forces batching. Treating schemas as code with automated migrations removes this bottleneck.

Application configuration

Environment-specific configuration - database connection strings, API endpoints, feature flag states, logging levels - should be defined as code and managed through version control.

What this looks like:

Configuration values stored in a config management system (Consul, AWS Parameter Store, environment variable definitions in infrastructure code)
Configuration changes are committed, reviewed, and deployed through a pipeline
The same application artifact is deployed to every environment; only the configuration differs

What this replaces:

Configuration files edited manually on servers
Environment variables set by hand and forgotten
Configuration that exists only in a deployment runbook

See Application Config for detailed guidance on externalizing configuration.

Monitoring, alerting, and observability

Dashboards, alert rules, SLO definitions, and logging configuration should be defined as code.

What this looks like:

Alert rules defined in Terraform, Prometheus rules files, or Datadog monitors-as-code
Dashboards defined as JSON or YAML, not built by hand in a UI
SLO definitions tracked in version control alongside the services they measure
Logging configuration (what to log, where to send it, retention policies) in code

What this replaces:

Dashboards built manually in a monitoring UI that nobody knows how to recreate
Alert rules that were configured by hand during an incident and never documented
Monitoring configuration that exists only on the monitoring server

Why it matters for CD: If you deploy ten times a day, you need to know instantly whether each deployment is healthy. If your monitoring and alerting configuration is manual, it will drift, break, or be incomplete. Monitoring-as-code ensures that every service has consistent, reviewed, reproducible observability.

Security policies

Security controls - access policies, network rules, secret rotation schedules, compliance checks - should be defined as code and enforced automatically.

What this looks like:

IAM policies and RBAC rules defined in Terraform or policy-as-code tools (OPA, Sentinel)
Security scanning integrated into the pipeline (SAST, dependency scanning, container image scanning)
Secret rotation automated and defined in code
Compliance checks that run on every commit, not once a quarter

What this replaces:

Security reviews that happen at the end of the development cycle
Access policies configured through UIs and never audited
Compliance as a manual checklist performed before each release

Why it matters for CD: Security and compliance requirements are the most common organizational blockers for CD. When security controls are defined as code and enforced by the pipeline, you can prove to auditors that every change passed security checks automatically. This is stronger evidence than a manual review, and it does not slow down delivery.

The “One Change, One Process” Test

For every type of artifact in your system, ask:

If the answer is yes, the artifact is managed as code. If the answer involves SSH, a UI, a ticket to another team, or a manual step, it is not.

Artifact	Managed as code?	If not, the risk is…
Application source code	Usually yes	-
Infrastructure (servers, networks, cloud resources)	Often no	Snowflake environments, slow provisioning, unreproducible disasters
Pipeline definitions	Sometimes	Pipeline changes are slow, unreviewed, and risky
Database schemas	Sometimes	Schema changes require manual coordination and downtime
Application configuration	Sometimes	Config drift between environments, “works in staging” failures
Monitoring and alerting	Rarely	Monitoring gaps, unreproducible dashboards, alert fatigue
Security policies	Rarely	Security as a gate instead of a guardrail, audit failures

The goal is for every row in this table to be “yes.” You will not get there overnight, but every artifact you move from manual to code-managed removes a bottleneck and a risk.

How to Get There

Start with what blocks you most

Do not try to move everything to code at once. Identify the artifact type that causes the most pain or blocks deployments most frequently:

If environment provisioning takes days, start with infrastructure as code.
If database changes are the reason you cannot deploy more than once a week, start with schema migrations as code.
If pipeline changes require tickets to a platform team, start with pipeline as code.
If configuration drift causes production incidents, start with configuration as code.

Apply the same practices as application code

Once an artifact is defined as code, treat it with the same rigor as application code:

Store it in version control (ideally in the same repository as the application it supports)
Review changes before they are applied
Test changes automatically (linting, dry-runs, policy checks)
Deliver changes through a pipeline
Never modify the artifact outside of this process

Eliminate manual pathways

The hardest part is closing the manual back doors. As long as someone can SSH into a server and make a change, or click through a UI to modify infrastructure, the code-defined state will drift from reality.

The principle is the same as Single Path to Production for application code: the pipeline is the only way any change reaches production. This applies to infrastructure, configuration, schemas, monitoring, and policies just as much as it applies to application code.

Measuring Progress

Metric	What to look for
Artifact types managed as code	Track how many of the categories above are fully code-managed. The number should increase over time.
Manual changes to production	Count any change made outside of a pipeline (SSH, UI clicks, manual scripts). Target: zero.
Environment recreation time	How long does it take to recreate a production-like environment from scratch? Should decrease as more infrastructure moves to code.
Mean time to recovery	When infrastructure-as-code is in place, recovery from failures is “re-run the pipeline.” MTTR drops dramatically.

Build Automation - The build itself must be a single, version-controlled command
Single Path to Production - The pipeline is the only way changes reach production
Application Config - Externalize configuration from artifacts
Deterministic Pipeline - Same inputs, same outputs, every time
Production-Like Environments - Infrastructure-as-code enables environment parity

3.3 - Phase 2: Pipeline

Build the automated path from commit to production: a single, deterministic pipeline that deploys immutable artifacts.

Key question: “Can we deploy any commit automatically?”

This phase creates the delivery pipeline - the automated path that takes every commit through build, test, and deployment stages. When done right, the pipeline is the only way changes reach production.

What You’ll Do

Establish a single path to production - One pipeline for all changes
Make the pipeline deterministic - Same inputs always produce same outputs
Define “deployable” - Clear criteria for what’s ready to ship
Use immutable artifacts - Build once, deploy everywhere
Externalize application config - Separate config from code
Use production-like environments - Test in environments that match production
Design your pipeline architecture - Efficient quality gates for your context
Enable rollback - Fast recovery from any deployment

Why This Phase Matters

The pipeline is the backbone of continuous delivery. It replaces manual handoffs with automated quality gates, ensures every change goes through the same validation process, and makes deployment a routine, low-risk event.

When You’re Ready to Move On

You’re ready for Phase 3: Optimize when:

Every change reaches production through the same automated pipeline
The pipeline produces the same result for the same inputs
You can deploy any green build to production with confidence
Rollback takes minutes, not hours

3.3.1 - Single Path to Production

All changes reach production through the same automated pipeline - no exceptions.

Phase 2 - Pipeline | Adapted from MinimumCD.org

Definition

A single path to production means that every change - whether it is a feature, a bug fix, a configuration update, or an infrastructure change - follows the same automated pipeline to reach production. There is exactly one route from a developer’s commit to a running production system. No side doors. No emergency shortcuts. No “just this once” manual deployments.

This is the most fundamental constraint of a continuous delivery pipeline. If you allow multiple paths, you cannot reason about the state of production. You lose the ability to guarantee that every change has been validated, and you undermine every other practice in this phase.

Why It Matters for CD Migration

Teams migrating to continuous delivery often carry legacy deployment processes - a manual runbook for “emergency” fixes, a separate path for database changes, or a distinct workflow for infrastructure updates. Each additional path is a source of unvalidated risk.

Establishing a single path to production is the first pipeline practice because every subsequent practice depends on it. A deterministic pipeline only works if all changes flow through it. Immutable artifacts are only trustworthy if no other mechanism can alter what reaches production. Your deployable definition is meaningless if changes can bypass the gates.

Key Principles

One pipeline for all changes

Every type of change uses the same pipeline:

Application code - features, fixes, refactors
Infrastructure as Code - Terraform, CloudFormation, Pulumi, Ansible
Pipeline definitions - the pipeline itself is versioned and deployed through the pipeline
Configuration changes - environment variables, feature flags, routing rules
Database migrations - schema changes, data migrations

Same pipeline for all environments

The pipeline that deploys to development is the same pipeline that deploys to staging and production. The only difference between environments is the configuration injected at deployment time. If your staging deployment uses a different mechanism than your production deployment, you are not testing the deployment process itself.

No manual deployments

If a human can bypass the pipeline and push a change directly to production, the single path is broken. This includes:

SSH access to production servers for ad-hoc changes
Direct container image pushes outside the pipeline
Console-based configuration changes that are not captured in version control
“Break glass” procedures that skip validation stages

Anti-Patterns

Integration branches and multi-branch deployment paths

Using separate branches (such as develop, release, hotfix) that each have their own deployment workflow creates multiple paths. GitFlow is a common source of this anti-pattern. When a hotfix branch deploys through a different pipeline than the develop branch, you cannot be confident that the hotfix has undergone the same validation.

Environment-specific pipelines

Building a separate pipeline for staging versus production - or worse, manually deploying to staging and only using automation for production - means you are not testing your deployment process in lower environments.

“Emergency” manual deployments

The most dangerous anti-pattern is the manual deployment reserved for emergencies. Under pressure, teams bypass the pipeline “just this once,” introducing an unvalidated change into production. The fix for this is not to allow exceptions - it is to make the pipeline fast enough that it is always the fastest path to production.

Separate pipelines for different change types

Having one pipeline for application code, another for infrastructure, and yet another for database changes means that coordinated changes across these layers are never validated together.

Good Patterns

Feature flags

Use feature flags to decouple deployment from release. Code can be merged and deployed through the pipeline while the feature remains hidden behind a flag. This eliminates the need for long-lived branches and separate deployment paths for “not-ready” features.

Branch by abstraction

For large-scale refactors or technology migrations, use branch by abstraction to make incremental changes that can be deployed through the standard pipeline at every step. Create an abstraction layer, build the new implementation behind it, switch over incrementally, and remove the old implementation - all through the same pipeline.

Dark launching

Deploy new functionality to production without exposing it to users. The code runs in production, processes real data, and generates real metrics - but its output is not shown to users. This validates the change under production conditions while managing risk.

Connect tests last

When building a new integration, start by deploying the code without connecting it to the live dependency. Validate the deployment, the configuration, and the basic behavior first. Connect to the real dependency as the final step. This keeps the change deployable through the pipeline at every stage of development.

How to Get Started

Step 1: Map your current deployment paths

Document every way that changes currently reach production. Include manual processes, scripts, CI/CD pipelines, direct deployments, and any emergency procedures. You will likely find more paths than you expected.

Step 2: Identify the primary path

Choose or build one pipeline that will become the single path. This pipeline should be the most automated and well-tested path you have. All other paths will converge into it.

Step 3: Eliminate the easiest alternate paths first

Start by removing the deployment paths that are used least frequently or are easiest to replace. For each path you eliminate, migrate its changes into the primary pipeline.

Step 4: Make the pipeline fast enough for emergencies

The most common reason teams maintain manual deployment shortcuts is that the pipeline is too slow for urgent fixes. If your pipeline takes 45 minutes and an incident requires a fix in 10, the team will bypass the pipeline. Invest in pipeline speed so that the automated path is always the fastest option.

Step 5: Remove break-glass access

Once the pipeline is fast and reliable, remove the ability to deploy outside of it. Revoke direct production access. Disable manual deployment scripts. Make the pipeline the only way.

Connection to the Pipeline Phase

Single path to production is the foundation of Phase 2. Without it, every other pipeline practice is compromised:

Deterministic pipeline requires all changes to flow through it to provide guarantees
Deployable definition must be enforced by a single set of gates
Immutable artifacts are only trustworthy when produced by a known, consistent process
Rollback relies on the pipeline to deploy the previous version through the same path

Establishing this practice first creates the constraint that makes the rest of the pipeline meaningful.

3.3.2 - Deterministic Pipeline

The same inputs to the pipeline always produce the same outputs.

Phase 2 - Pipeline | Adapted from MinimumCD.org

Definition

A deterministic pipeline produces consistent, repeatable results. Given the same commit, the same environment definition, and the same configuration, the pipeline will build the same artifact, run the same tests, and produce the same outcome - every time. There is no variance introduced by uncontrolled dependencies, environmental drift, manual intervention, or non-deterministic test behavior.

Determinism is what transforms a pipeline from “a script that usually works” into a reliable delivery system. When the pipeline is deterministic, a green build means something. A failed build points to a real problem. Teams can trust the signal.

Why It Matters for CD Migration

Non-deterministic pipelines are the single largest source of wasted time in delivery organizations. When builds fail randomly, teams learn to ignore failures. When the same commit passes on retry, teams stop investigating root causes. When different environments produce different results, teams lose confidence in pre-production validation.

During a CD migration, teams are building trust in automation. Every flaky test, every “works on my machine” failure, and every environment-specific inconsistency erodes that trust. A deterministic pipeline is what earns the team’s confidence that automation can replace manual verification.

Key Principles

Version control everything

Every input to the pipeline must be version controlled:

Application source code - the obvious one
Infrastructure as Code - the environment definitions themselves
Pipeline definitions - the CI/CD configuration files
Test data and fixtures - the data used by automated tests
Dependency lockfiles - exact versions of every dependency (e.g., package-lock.json, Pipfile.lock, go.sum)
Tool versions - the versions of compilers, runtimes, linters, and build tools

If an input to the pipeline is not version controlled, it can change without notice, and the pipeline is no longer deterministic.

Lock dependency versions

Floating dependency versions (version ranges, “latest” tags) are a common source of non-determinism. A build that worked yesterday can break today because a transitive dependency released a new version overnight.

Use lockfiles to pin exact versions of every dependency. Commit lockfiles to version control. Update dependencies intentionally through pull requests, not implicitly through builds.

Eliminate environmental variance

The pipeline should run in a controlled, reproducible environment. Containerize build steps so that the build environment is defined in code and does not drift over time. Use the same base images in CI as in production. Pin tool versions explicitly rather than relying on whatever is installed on the build agent.

Remove human intervention

Any manual step in the pipeline is a source of variance. A human choosing which tests to run, deciding whether to skip a stage, or manually approving a step introduces non-determinism. The pipeline should run from commit to deployment without human decisions.

This does not mean humans have no role - it means the pipeline’s behavior is fully determined by its inputs, not by who is watching it run.

Fix flaky tests immediately

A flaky test is a test that sometimes passes and sometimes fails for the same code. Flaky tests are the most insidious form of non-determinism because they train teams to distrust the test suite.

When a flaky test is detected, the response must be immediate:

Quarantine the test - remove it from the pipeline so it does not block other changes
Fix it or delete it - flaky tests provide negative value; they are worse than no test
Investigate the root cause - flakiness often indicates a real problem (race conditions, shared state, time dependencies, external service reliance)

Never allow a culture of “just re-run it” to take hold. Every re-run masks a real problem.

Anti-Patterns

Unpinned dependencies

Using version ranges like ^1.2.0 or >=2.0 in dependency declarations without a lockfile means the build resolves different versions on different days. This applies to application dependencies, build plugins, CI tool versions, and base container images.

Shared, mutable build environments

Build agents that accumulate state between builds (cached files, installed packages, leftover containers) produce different results depending on what ran previously. Each build should start from a clean, known state.

Tests that depend on external services

Tests that call live external APIs, depend on shared databases, or rely on network resources introduce uncontrolled variance. External services change, experience outages, and respond with different latency - all of which make the pipeline non-deterministic.

Time-dependent tests

Tests that depend on the current time, current date, or elapsed time are inherently non-deterministic. A test that passes at 2:00 PM and fails at midnight is not testing your application - it is testing the clock.

Manual retry culture

Teams that routinely re-run failed pipelines without investigating the failure have accepted non-determinism as normal. This is a cultural anti-pattern that must be addressed alongside the technical ones.

Good Patterns

Containerized build environments

Define your build environment as a container image. Pin the base image version. Install exact versions of all tools. Run every build in a fresh instance of this container. This eliminates variance from the build environment.

Hermetic builds

A hermetic build is one that does not access the network during the build process. All dependencies are pre-fetched and cached. The build can run identically on any machine, at any time, with or without network access.

Contract tests for external dependencies

Replace live calls to external services with contract tests. These tests verify that your code interacts correctly with an external service’s API contract without actually calling the service. Combine with service virtualization or test doubles for integration tests.

Deterministic test ordering

Run tests in a fixed, deterministic order - or better, ensure every test is independent and can run in any order. Many test frameworks default to random ordering to detect inter-test dependencies; use this during development but ensure no ordering dependencies exist.

Immutable CI infrastructure

Treat CI build agents as cattle, not pets. Provision them from images. Replace them rather than updating them. Never allow state to accumulate on a build agent between pipeline runs.

How to Get Started

Step 1: Audit your pipeline inputs

List every input to your pipeline that is not version controlled. This includes dependency versions, tool versions, environment configurations, test data, and pipeline definitions themselves.

Step 2: Add lockfiles and pin versions

For every dependency manager in your project, ensure a lockfile is committed to version control. Pin CI tool versions explicitly. Pin base image versions in Dockerfiles.

Step 3: Containerize the build

Move your build steps into containers with explicitly defined environments. This is often the highest-leverage change for improving determinism.

Step 4: Identify and fix flaky tests

Review your test history for tests that have both passed and failed for the same commit. Quarantine them immediately and fix or remove them within a defined time window (such as one sprint).

Step 5: Monitor pipeline determinism

Track the rate of pipeline failures that are resolved by re-running without code changes. This metric (sometimes called the “re-run rate”) directly measures non-determinism. Drive it to zero.

Connection to the Pipeline Phase

Determinism is what gives the single path to production its authority. If the pipeline produces inconsistent results, teams will work around it. A deterministic pipeline is also the prerequisite for a meaningful deployable definition - your quality gates are only as reliable as the pipeline that enforces them.

When the pipeline is deterministic, immutable artifacts become trustworthy: you know that the artifact was built by a consistent, repeatable process, and its validation results are real.

3.3.3 - Deployable Definition

Clear, automated criteria that determine when a change is ready for production.

Phase 2 - Pipeline | Adapted from MinimumCD.org

Definition

A deployable definition is the set of automated quality criteria that every artifact must satisfy before it is considered ready for production. It is the pipeline’s answer to the question: “How do we know this is safe to deploy?”

This is not a checklist that a human reviews. It is a set of automated gates - executable validations built into the pipeline - that every change must pass. If the pipeline is green, the artifact is deployable. If the pipeline is red, it is not. There is no ambiguity, no judgment call, and no “looks good enough.”

Why It Matters for CD Migration

Without a clear, automated deployable definition, teams rely on human judgment to decide when something is ready to ship. This creates bottlenecks (waiting for approval), variance (different people apply different standards), and fear (nobody is confident the change is safe). All three are enemies of continuous delivery.

During a CD migration, the deployable definition replaces manual approval processes with automated confidence. It is what allows a team to say “any green build can go to production” - which is the prerequisite for continuous deployment.

Key Principles

The definition must be automated

Every criterion in the deployable definition is enforced by an automated check in the pipeline. If a requirement cannot be automated, either find a way to automate it or question whether it belongs in the deployment path.

The definition must be comprehensive

The deployable definition should cover all dimensions of quality that matter for production readiness:

Security

Static Application Security Testing (SAST) - scan source code for known vulnerability patterns
Dependency vulnerability scanning - check all dependencies against known vulnerability databases (CVE lists)
Secret detection - verify that no credentials, API keys, or tokens are present in the codebase
Container image scanning - if deploying containers, scan images for known vulnerabilities
License compliance - verify that dependency licenses are compatible with your distribution requirements

Functionality

Unit tests - fast, isolated tests that verify individual components behave correctly
Integration tests - tests that verify components work together correctly
End-to-end tests - tests that verify the system works from the user’s perspective
Regression tests - tests that verify previously fixed defects have not reappeared
Contract tests - tests that verify APIs conform to their published contracts

Compliance

Audit trail - the pipeline itself produces the compliance artifact: who changed what, when, and what validations it passed
Policy as code - organizational policies (e.g., “no deployments on Friday”) encoded as pipeline logic
Change documentation - automatically generated from commit metadata and pipeline results

Performance

Performance benchmarks - verify that key operations complete within acceptable thresholds
Load test baselines - verify that the system handles expected load without degradation
Resource utilization checks - verify that the change does not introduce memory leaks or excessive CPU usage

Reliability

Health check validation - verify that the application starts up correctly and responds to health checks
Graceful degradation tests - verify that the system behaves acceptably when dependencies fail
Rollback verification - verify that the deployment can be rolled back (see Rollback)

Code Quality

Linting and static analysis - enforce code style and detect common errors
Code coverage thresholds - not as a target, but as a safety net to detect large untested areas
Complexity metrics - flag code that exceeds complexity thresholds for review

The definition must be fast

A deployable definition that takes hours to evaluate will not support continuous delivery. The entire pipeline - including all deployable definition checks - should complete in minutes, not hours. This often requires running checks in parallel, investing in test infrastructure, and making hard choices about which slow checks provide enough value to keep.

The definition must be maintained

The deployable definition is a living document. As the system evolves, new failure modes emerge, and the definition should be updated to catch them. When a production incident occurs, the team should ask: “What automated check could have caught this?” and add it to the definition.

Anti-Patterns

Manual approval gates

Requiring a human to review and approve a deployment after the pipeline has passed all automated checks is an anti-pattern. It adds latency, creates bottlenecks, and implies that the automated checks are not sufficient. If a human must approve, it means your automated definition is incomplete - fix the definition rather than adding a manual gate.

“Good enough” tolerance

Allowing deployments when some checks fail because “that test always fails” or “it is only a warning” degrades the deployable definition to meaninglessness. Either the check matters and must pass, or it does not matter and should be removed.

Post-deployment validation only

Running validation only after deployment to production (production smoke tests, manual QA in production) means you are using production users to find problems. Pre-deployment validation must be comprehensive enough that post-deployment checks are a safety net, not the primary quality gate.

Inconsistent definitions across teams

When different teams have different deployable definitions, organizational confidence in deployment varies. While the specific checks may differ by service, the categories of validation (security, functionality, performance, compliance) should be consistent.

Good Patterns

Pipeline gates as policy

Encode the deployable definition as pipeline stages that block progression. A change cannot move from build to test, or from test to deployment, unless the preceding stage passes completely. The pipeline enforces the definition; no human override is possible.

Shift-left validation

Run the fastest, most frequently failing checks first. Unit tests and linting run before integration tests. Integration tests run before end-to-end tests. Security scans run in parallel with test stages. This gives developers the fastest possible feedback.

Continuous definition improvement

After every production incident, add or improve a check in the deployable definition that would have caught the issue. Over time, the definition becomes a comprehensive record of everything the team has learned about quality.

Visible, shared definitions

Make the deployable definition visible to all team members. Display the current pipeline status on dashboards. When a check fails, provide clear, actionable feedback about what failed and why. The definition should be understood by everyone, not hidden in pipeline configuration.

How to Get Started

Step 1: Document your current “definition of done”

Write down every check that currently happens before a deployment - automated or manual. Include formal checks (tests, scans) and informal ones (someone eyeballs the logs, someone clicks through the UI).

Step 2: Classify each check

For each check, determine: Is it automated? Is it fast? Is it reliable? Is it actually catching real problems? This reveals which checks are already pipeline-ready and which need work.

Step 3: Automate the manual checks

For every manual check, determine how to automate it. A human clicking through the UI becomes an end-to-end test. A human reviewing logs becomes an automated log analysis step. A manager approving a deployment becomes a set of automated policy checks.

Step 4: Build the pipeline gates

Organize your automated checks into pipeline stages. Fast checks first, slower checks later. All checks must pass for the artifact to be considered deployable.

Step 5: Remove manual approvals

Once the automated definition is comprehensive enough that a green build genuinely means “safe to deploy,” remove manual approval gates. This is often the most culturally challenging step.

Connection to the Pipeline Phase

The deployable definition is the contract between the pipeline and the organization. It is what makes the single path to production trustworthy - because every change that passes through the path has been validated against a clear, comprehensive standard.

Combined with a deterministic pipeline, the deployable definition ensures that green means green and red means red. Combined with immutable artifacts, it ensures that the artifact you validated is the artifact you deploy. It is the bridge between automated process and organizational confidence.

3.3.4 - Immutable Artifacts

Build once, deploy everywhere. The same artifact is used in every environment.

Phase 2 - Pipeline | Adapted from MinimumCD.org

Definition

An immutable artifact is a build output that is created exactly once and deployed to every environment without modification. The binary, container image, or package that runs in production is byte-for-byte identical to the one that passed through testing. Nothing is recompiled, repackaged, or altered between environments.

“Build once, deploy everywhere” is the core principle. The artifact is sealed at build time. Configuration is injected at deployment time (see Application Configuration), but the artifact itself never changes.

Why It Matters for CD Migration

If you build a separate artifact for each environment - or worse, make manual adjustments to artifacts at deployment time - you can never be certain that what you tested is what you deployed. Every rebuild introduces the possibility of variance: a different dependency resolved, a different compiler flag applied, a different snapshot of the source.

Immutable artifacts eliminate an entire class of “works in staging, fails in production” problems. They provide confidence that the pipeline results are real: the artifact that passed every quality gate is the exact artifact running in production.

For teams migrating to CD, this practice is a concrete, mechanical step that delivers immediate trust. Once the team sees that the same container image flows from CI to staging to production, the deployment process becomes verifiable instead of hopeful.

Key Principles

Build once

The artifact is produced exactly once, during the build stage of the pipeline. It is stored in an artifact repository (such as a container registry, Maven repository, npm registry, or object store) and every subsequent stage of the pipeline - and every environment - pulls and deploys that same artifact.

No manual adjustments

Artifacts are never modified after creation. This means:

No recompilation for different environments
No patching binaries in staging to fix a test failure
No adding environment-specific files into a container image after the build
No editing properties files inside a deployed artifact

Version everything that goes into the build

Because the artifact is built once and cannot be changed, every input must be correct at build time:

Source code - committed to version control at a specific commit hash
Dependencies - locked to exact versions via lockfiles
Build tools - pinned to specific versions
Build configuration - stored in version control alongside the source

Tag and trace

Every artifact must be traceable back to the exact commit, pipeline run, and set of inputs that produced it. Use content-addressable identifiers (such as container image digests), semantic version tags, or build metadata that links the artifact to its source.

Anti-Patterns

Rebuilding per environment

Building the artifact separately for development, staging, and production - even from the same source - means each artifact is a different build. Different builds can produce different results due to non-deterministic build processes, updated dependencies, or changed build environments.

SNAPSHOT or mutable versions

Using version identifiers like -SNAPSHOT (Maven), latest (container images), or unversioned “current” references means the same version label can point to different artifacts at different times. This makes it impossible to know exactly what is deployed.

Manual intervention at failure points

When a deployment fails, the fix must go through the pipeline. Manually patching the artifact, restarting with modified configuration, or applying a hotfix directly to the running system breaks immutability and bypasses the quality gates.

Environment-specific builds

Build scripts that use conditionals like “if production, include X” create environment-coupled artifacts. The artifact should be environment-agnostic; environment configuration handles the differences.

Artifacts that self-modify

Applications that write to their own deployment directory, modify their own configuration files at runtime, or store state alongside the application binary are not truly immutable. Runtime state must be stored externally.

Good Patterns

Container images as immutable artifacts

Container images are an excellent vehicle for immutable artifacts. A container image built in CI, pushed to a registry with a content-addressable digest, and pulled into each environment is inherently immutable. The image that ran in staging is provably identical to the image running in production.

Artifact promotion

Instead of rebuilding for each environment, promote the same artifact through environments. The pipeline builds the artifact once, deploys it to a test environment, validates it, then promotes it (deploys the same artifact) to staging, then production. The artifact never changes; only the environment it runs in changes.

Content-addressable storage

Use content-addressable identifiers (SHA-256 digests, content hashes) rather than mutable tags as the primary artifact reference. A content-addressed artifact is immutable by definition: changing any byte changes the address.

Signed artifacts

Digitally sign artifacts at build time and verify the signature before deployment. This guarantees that the artifact has not been tampered with between the build and the deployment. This is especially important for supply chain security.

Reproducible builds

Strive for builds where the same source input produces a bit-for-bit identical artifact. While not always achievable (timestamps, non-deterministic linkers), getting close makes it possible to verify that an artifact was produced from its claimed source.

How to Get Started

Step 1: Separate build from deployment

If your pipeline currently rebuilds for each environment, restructure it into two distinct phases: a build phase that produces a single artifact, and a deployment phase that takes that artifact and deploys it to a target environment with the appropriate configuration.

Step 2: Set up an artifact repository

Choose an artifact repository appropriate for your technology stack - a container registry for container images, a package registry for libraries, or an object store for compiled binaries. All downstream pipeline stages pull from this repository.

Step 3: Eliminate mutable version references

Replace latest tags, -SNAPSHOT versions, and any other mutable version identifier with immutable references. Use commit-hash-based tags, semantic versions, or content-addressable digests.

Step 4: Implement artifact promotion

Modify your pipeline to deploy the same artifact to each environment in sequence. The pipeline should pull the artifact from the repository by its immutable identifier and deploy it without modification.

Step 5: Add traceability

Ensure every deployed artifact can be traced back to its source commit, build log, and pipeline run. Label container images with build metadata. Store build provenance alongside the artifact in the repository.

Step 6: Verify immutability

Periodically verify that what is running in production matches what the pipeline built. Compare image digests, checksums, or signatures. This catches any manual modifications that may have bypassed the pipeline.

Connection to the Pipeline Phase

Immutable artifacts are the physical manifestation of trust in the pipeline. The single path to production ensures all changes flow through the pipeline. The deterministic pipeline ensures the build is repeatable. The deployable definition ensures the artifact meets quality criteria. Immutability ensures that the validated artifact - and only that artifact - reaches production.

This practice also directly supports rollback: because previous artifacts are stored unchanged in the artifact repository, rolling back is simply deploying a previous known-good artifact.

3.3.5 - Application Configuration

Separate configuration from code so the same artifact works in every environment.

Phase 2 - Pipeline | Adapted from MinimumCD.org

Definition

Application configuration is the practice of correctly separating what varies between environments from what does not, so that a single immutable artifact can run in any environment. This distinction - drawn from the Twelve-Factor App methodology - is essential for continuous delivery.

There are two distinct types of configuration:

Application config - settings that define how the application behaves, are the same in every environment, and should be bundled with the artifact. Examples: routing rules, feature flag defaults, serialization formats, timeout policies, retry strategies.
Environment config - settings that vary by deployment target and must be injected at deployment time. Examples: database connection strings, API endpoint URLs, credentials, resource limits, logging levels for that environment.

Getting this distinction right is critical. Bundling environment config into the artifact breaks immutability. Externalizing application config that does not vary creates unnecessary complexity and fragility.

Why It Matters for CD Migration

Configuration is where many CD migrations stall. Teams that have been deploying manually often have configuration tangled with code - hardcoded URLs, environment-specific build profiles, configuration files that are manually edited during deployment. Untangling this is a prerequisite for immutable artifacts and automated deployments.

When configuration is handled correctly, the same artifact flows through every environment without modification, environment-specific values are injected at deployment time, and feature behavior can be changed without redeploying. This enables the deployment speed and safety that continuous delivery requires.

Key Principles

Bundle what does not vary

Application configuration that is identical across all environments belongs inside the artifact. This includes:

Default feature flag values - the static, compile-time defaults for feature flags
Application routing and mapping rules - URL patterns, API route definitions
Serialization and encoding settings - JSON configuration, character encoding
Internal timeout and retry policies - backoff strategies, circuit breaker thresholds
Validation rules - input validation constraints, business rule parameters

These values are part of the application’s behavior definition. They should be version controlled with the source code and deployed as part of the artifact.

Externalize what varies

Environment configuration that changes between deployment targets must be injected at deployment time:

Database connection strings - different databases for test, staging, production
External service URLs - different endpoints for downstream dependencies
Credentials and secrets - always injected, never bundled, never in version control
Resource limits - memory, CPU, connection pool sizes tuned per environment
Environment-specific logging levels - verbose in development, structured in production
Feature flag overrides - dynamic flag values managed by an external flag service

Feature flags: static vs. dynamic

Feature flags deserve special attention because they span both categories:

Static feature flags - compiled into the artifact as default values. They define the initial state of a feature when the application starts. Changing them requires a new build and deployment.
Dynamic feature flags - read from an external service at runtime. They can be toggled without deploying. Use these for operational toggles (kill switches, gradual rollouts) and experiment flags (A/B tests).

A well-designed feature flag system uses static defaults (bundled in the artifact) that can be overridden by a dynamic source (external flag service). If the flag service is unavailable, the application falls back to its static defaults - a safe, predictable behavior.

Anti-Patterns

Hardcoded environment-specific values

Database URLs, API endpoints, or credentials embedded directly in source code or configuration files that are baked into the artifact. This forces a different build per environment and makes secrets visible in version control.

Externalizing everything

Moving all configuration to an external service - including values that never change between environments - creates unnecessary runtime dependencies. If the configuration service is down and a value that is identical in every environment cannot be read, the application fails to start for no good reason.

Environment-specific build profiles

Build systems that use profiles like mvn package -P production or Webpack configurations that toggle behavior based on NODE_ENV at build time create environment-coupled artifacts. The artifact must be the same regardless of where it will run.

Configuration files edited during deployment

Manually editing application.properties, .env files, or YAML configurations on the server during or after deployment is error-prone, unrepeatable, and invisible to the pipeline. All configuration injection must be automated.

Secrets in version control

Credentials, API keys, certificates, and tokens must never be stored in version control - not even in “private” repositories, not even encrypted with simple mechanisms. Use a secrets manager (Vault, AWS Secrets Manager, Azure Key Vault) and inject secrets at deployment time.

Good Patterns

Environment variables for environment config

Following the Twelve-Factor App approach, inject environment-specific values as environment variables. This is universally supported across languages and platforms, works with containers and orchestrators, and keeps the artifact clean.

Layered configuration

Use a configuration framework that supports layering:

Defaults - bundled in the artifact (application config)
Environment overrides - injected via environment variables or mounted config files
Dynamic overrides - read from a feature flag service or configuration service at runtime

Each layer overrides the previous one. The application always has a working default, and environment-specific or dynamic values override only what needs to change.

Config maps and secrets in orchestrators

Kubernetes ConfigMaps and Secrets (or equivalent mechanisms in other orchestrators) provide a clean separation between the artifact (the container image) and the environment-specific configuration. The image is immutable; the configuration is injected at pod startup.

Secrets management with rotation

Use a dedicated secrets manager that supports automatic rotation, audit logging, and fine-grained access control. The application retrieves secrets at startup or on-demand, and the secrets manager handles rotation without requiring redeployment.

Configuration validation at startup

The application should validate its configuration at startup and fail fast with a clear error message if required configuration is missing or invalid. This catches configuration errors immediately rather than allowing the application to start in a broken state.

How to Get Started

Step 1: Inventory your configuration

List every configuration value your application uses. For each one, determine: Does this value change between environments? If yes, it is environment config. If no, it is application config.

Step 2: Move environment config out of the artifact

For every environment-specific value currently bundled in the artifact (hardcoded URLs, build profiles, environment-specific property files), extract it and inject it via environment variable, config map, or secrets manager.

Step 3: Bundle application config with the code

For every value that does not vary between environments, ensure it is committed to version control alongside the source code and included in the artifact at build time. Remove it from any external configuration system where it adds unnecessary complexity.

Step 4: Implement feature flags properly

Set up a feature flag framework with static defaults in the code and an external flag service for dynamic overrides. Ensure the application degrades gracefully if the flag service is unavailable.

Step 5: Remove environment-specific build profiles

Eliminate any build-time branching based on target environment. The build produces one artifact. Period.

Step 6: Automate configuration injection

Ensure that configuration injection is fully automated in the deployment pipeline. No human should manually set environment variables or edit configuration files during deployment.

Connection to the Pipeline Phase

Application configuration is the enabler that makes immutable artifacts practical. An artifact can only be truly immutable if it does not contain environment-specific values that would need to change between deployments.

Correct configuration separation also supports production-like environments - because the same artifact runs everywhere, the only difference between environments is the injected configuration, which is itself version controlled and automated.

When configuration is externalized correctly, rollback becomes straightforward: deploy the previous artifact with the appropriate configuration, and the system returns to its prior state.

3.3.6 - Production-Like Environments

Test in environments that match production to catch environment-specific issues early.

Phase 2 - Pipeline | Adapted from MinimumCD.org

Definition

Production-like environments are pre-production environments that mirror the infrastructure, configuration, and behavior of production closely enough that passing tests in these environments provides genuine confidence that the change will work in production.

“Production-like” does not mean “identical to production” in every dimension. It means that the aspects of the environment relevant to the tests being run match production sufficiently to produce a valid signal. A unit test environment needs the right runtime version. An integration test environment needs the right service topology. A staging environment needs the right infrastructure, networking, and data characteristics.

Why It Matters for CD Migration

The gap between pre-production environments and production is where deployment failures hide. Teams that test in environments that differ significantly from production - in operating system, database version, network topology, resource constraints, or configuration - routinely discover issues only after deployment.

For a CD migration, production-like environments are what transform pre-production testing from “we hope this works” to “we know this works.” They close the gap between the pipeline’s quality signal and the reality of production, making it safe to deploy automatically.

Key Principles

Staging reflects production infrastructure

Your staging environment should match production in the dimensions that affect application behavior:

Infrastructure platform - same cloud provider, same orchestrator, same service mesh
Network topology - same load balancer configuration, same DNS resolution patterns, same firewall rules
Database engine and version - same database type, same version, same configuration parameters
Operating system and runtime - same OS distribution, same runtime version, same system libraries
Service dependencies - same versions of downstream services, or accurate test doubles

Staging does not necessarily need the same scale as production (fewer replicas, smaller instances), but the architecture must be the same.

Environments are version controlled

Every aspect of the environment that can be defined in code must be:

Infrastructure definitions - Terraform, CloudFormation, Pulumi, or equivalent
Configuration - Kubernetes manifests, Helm charts, Ansible playbooks
Network policies - security groups, firewall rules, service mesh configuration
Monitoring and alerting - the same observability configuration in all environments

Version-controlled environments can be reproduced, compared, and audited. Manual environment configuration cannot.

Ephemeral environments

Ephemeral environments are full-stack, on-demand, short-lived environments spun up for a specific purpose - a pull request, a test run, a demo - and destroyed when that purpose is complete.

Key characteristics of ephemeral environments:

Full-stack - they include the application and all of its dependencies (databases, message queues, caches, downstream services), not just the application in isolation
On-demand - any developer or pipeline can spin one up at any time without waiting for a shared resource
Short-lived - they exist for hours or days, not weeks or months. This prevents configuration drift and stale state
Version controlled - the environment definition is in code, and the environment is created from a specific version of that code
Isolated - they do not share resources with other environments. No shared databases, no shared queues, no shared service instances

Ephemeral environments eliminate the “shared staging” bottleneck where multiple teams compete for a single pre-production environment and block each other’s progress.

Data is representative

The data in pre-production environments must be representative of production data in structure, volume, and characteristics. This does not mean using production data directly (which raises security and privacy concerns). It means:

Schema matches production - same tables, same columns, same constraints
Volume is realistic - tests run against data sets large enough to reveal performance issues
Data characteristics are representative - edge cases, special characters, null values, and data distributions that match what the application will encounter
Data is anonymized - if production data is used as a seed, all personally identifiable information is removed or masked

Anti-Patterns

Shared, long-lived staging environments

A single staging environment shared by multiple teams becomes a bottleneck and a source of conflicts. Teams overwrite each other’s changes, queue up for access, and encounter failures caused by other teams’ work. Long-lived environments also drift from production as manual changes accumulate.

Environments that differ from production in critical ways

Running a different database version in staging than production, using a different operating system, or skipping the load balancer that exists in production creates blind spots where issues hide until they reach production.

“It works on my laptop” as validation

Developer laptops are the least production-like environment available. They have different operating systems, different resource constraints, different network characteristics, and different installed software. Local validation is valuable for fast feedback during development, but it does not replace testing in a production-like environment.

Manual environment provisioning

Environments created by manually clicking through cloud consoles, running ad-hoc scripts, or following runbooks are unreproducible and drift over time. If you cannot destroy and recreate the environment from code in minutes, it is not suitable for continuous delivery.

Synthetic-only test data

Using only hand-crafted test data with a few happy-path records misses the issues that emerge with production-scale data: slow queries, missing indexes, encoding problems, and edge cases that only appear in real-world data distributions.

Good Patterns

Infrastructure as Code for all environments

Define every environment - from local development to production - using the same Infrastructure as Code templates. The differences between environments are captured in configuration variables (instance sizes, replica counts, domain names), not in different templates.

Environment-per-pull-request

Automatically provision a full-stack ephemeral environment for every pull request. Run the full test suite against this environment. Tear it down when the pull request is merged or closed. This provides isolated, production-like validation for every change.

Production data sampling and anonymization

Build an automated pipeline that samples production data, anonymizes it (removing PII, masking sensitive fields), and loads it into pre-production environments. This provides realistic data without security or privacy risks.

Service virtualization for external dependencies

For external dependencies that cannot be replicated in pre-production (third-party APIs, partner systems), use service virtualization to create realistic test doubles that mimic the behavior, latency, and error modes of the real service.

Environment parity monitoring

Continuously compare pre-production environments against production to detect drift. Alert when the infrastructure, configuration, or service versions diverge. Tools that compare Terraform state, Kubernetes configurations, or cloud resource inventories can automate this comparison.

Namespaced environments in shared clusters

In Kubernetes or similar platforms, use namespaces to create isolated environments within a shared cluster. Each namespace gets its own set of services, databases, and configuration, providing isolation without the cost of separate clusters.

How to Get Started

Step 1: Audit environment parity

Compare your current pre-production environments against production across every relevant dimension: infrastructure, configuration, data, service versions, network topology. List every difference.

Step 2: Infrastructure-as-Code your environments

If your environments are not yet defined in code, start here. Define your production environment in Terraform, CloudFormation, or equivalent. Then create pre-production environments from the same definitions with different parameter values.

Step 3: Address the highest-risk parity gaps

From your audit, identify the differences most likely to cause production failures - typically database version mismatches, missing infrastructure components, or network configuration differences. Fix these first.

Step 4: Implement ephemeral environments

Build the tooling to spin up and tear down full-stack environments on demand. Start with a simplified version (perhaps without full data replication) and iterate toward full production parity.

Step 5: Automate data provisioning

Create an automated pipeline for generating or sampling representative test data. Include anonymization, schema validation, and data refresh on a regular schedule.

Step 6: Monitor and maintain parity

Set up automated checks that compare pre-production environments to production and alert on drift. Make parity a continuous concern, not a one-time setup.

Connection to the Pipeline Phase

Production-like environments are where the pipeline’s quality gates run. Without production-like environments, the deployable definition produces a false signal - tests pass in an environment that does not resemble production, and failures appear only after deployment.

Immutable artifacts flow through these environments unchanged, with only configuration varying. This combination - same artifact, production-like environment, environment-specific configuration - is what gives the pipeline its predictive power.

Production-like environments also support effective rollback testing: you can validate that a rollback works correctly in a staging environment before relying on it in production.

3.3.7 - Pipeline Architecture

Design efficient quality gates for your delivery system’s context.

Phase 2 - Pipeline | Adapted from Dojo Consortium

Definition

Pipeline architecture is the structural design of your delivery pipeline - how stages are organized, how quality gates are sequenced, how feedback loops operate, and how the pipeline evolves over time. It encompasses both the technical design of the pipeline and the improvement journey that a team follows from an initial, fragile pipeline to a mature, resilient delivery system.

Good pipeline architecture is not achieved in a single step. Teams progress through recognizable states, applying the Theory of Constraints to systematically identify and resolve bottlenecks. The goal is a loosely coupled architecture where independent services can be built, tested, and deployed independently through their own pipelines.

Why It Matters for CD Migration

Most teams beginning a CD migration have a pipeline that is somewhere between “barely functional” and “works most of the time.” The pipeline may be slow, fragile, or tightly coupled to other systems. Improving it requires a deliberate architectural approach - not just adding more stages or more tests, but designing the pipeline for the flow characteristics that continuous delivery demands.

Understanding where your pipeline architecture currently stands, and what the next improvement looks like, prevents teams from either stalling at a “good enough” state or attempting to jump directly to a target state that their context cannot support.

Three Architecture States

Teams typically progress through three recognizable states on their journey to mature pipeline architecture. Understanding which state you are in determines what improvements to prioritize.

Entangled (Requires Remediation)

In the entangled state, the pipeline has significant structural problems that prevent reliable delivery:

Multiple applications share a single pipeline - a change to one application triggers builds and tests for all applications, causing unnecessary delays and false failures
Shared, mutable infrastructure - pipeline stages depend on shared databases, shared environments, or shared services that introduce coupling and contention
Manual stages interrupt automated flow - manual approval gates, manual test execution, or manual environment provisioning block the pipeline for hours or days
No clear ownership - the pipeline is maintained by a central team, and application teams cannot modify it without filing tickets and waiting
Build times measured in hours - the pipeline is so slow that developers batch changes and avoid running it
Flaky tests are accepted - the team routinely re-runs failed pipelines, and failures are assumed to be transient

Remediation priorities:

Separate pipelines for separate applications
Remove manual stages or parallelize them out of the critical path
Fix or remove flaky tests
Establish clear pipeline ownership with the application team

Tightly Coupled (Transitional)

In the tightly coupled state, each application has its own pipeline, but pipelines depend on each other or on shared resources:

Integration tests span multiple services - a pipeline for service A runs integration tests that require service B, C, and D to be deployed in a specific state
Shared test environments - multiple pipelines deploy to the same staging environment, creating contention and sequencing constraints
Coordinated deployments - deploying service A requires simultaneously deploying service B, which requires coordinating two pipelines
Shared build infrastructure - pipelines compete for limited build agent capacity, causing queuing delays
Pipeline definitions are centralized - a shared pipeline library controls the structure, and application teams cannot customize it for their needs

Improvement priorities:

Replace cross-service integration tests with contract tests
Implement ephemeral environments to eliminate shared environment contention
Decouple service deployments using backward-compatible changes and feature flags
Give teams ownership of their pipeline definitions
Scale build infrastructure to eliminate queuing

Loosely Coupled (Goal)

In the loosely coupled state, each service has an independent pipeline that can build, test, and deploy without depending on other services’ pipelines:

Independent deployability - any service can be deployed at any time without coordinating with other teams
Contract-based integration - services verify their interactions through contract tests, not cross-service integration tests
Ephemeral, isolated environments - each pipeline creates its own test environment and tears it down when done
Team-owned pipelines - each team controls their pipeline definition and can optimize it for their service’s needs
Fast feedback - the pipeline completes in minutes, providing rapid feedback to developers
Self-service infrastructure - teams provision their own pipeline infrastructure without waiting for a central team

Applying the Theory of Constraints

Pipeline improvement follows the Theory of Constraints: identify the single biggest bottleneck, resolve it, and repeat. The key steps:

Step 1: Identify the constraint

Measure where time is spent in the pipeline. Common constraints include:

Slow test suites - tests that take 30+ minutes dominate the pipeline duration
Queuing for shared resources - pipelines waiting for build agents, shared environments, or manual approvals
Flaky failures and re-runs - time lost to investigating and re-running non-deterministic failures
Large batch sizes - pipelines triggered by large, infrequent commits that take longer to build and are harder to debug when they fail

Step 2: Exploit the constraint

Get the maximum throughput from the current constraint without changing the architecture:

Parallelize test execution across multiple agents
Cache dependencies to speed up the build stage
Prioritize pipeline runs (trunk commits before branch builds)
Deduplicate unnecessary work (skip unchanged modules)

Step 3: Subordinate everything else to the constraint

Ensure that other parts of the system do not overwhelm the constraint:

If the test stage is the bottleneck, do not add more tests without first making existing tests faster
If the build stage is the bottleneck, do not add more build steps without first optimizing the build

Step 4: Elevate the constraint

If exploiting the constraint is not sufficient, invest in removing it:

Rewrite slow tests to be faster
Replace shared environments with ephemeral environments
Replace manual gates with automated checks
Split monolithic pipelines into independent service pipelines

Step 5: Repeat

Once a constraint is resolved, a new constraint will emerge. This is expected. The pipeline improves through continuous iteration, not through a single redesign.

Key Design Principles

Fast feedback first

Organize pipeline stages so that the fastest checks run first. A developer should know within minutes if their change has an obvious problem (compilation failure, linting error, unit test failure). Slower checks (integration tests, security scans, performance tests) run after the fast checks pass.

Fail fast, fail clearly

When the pipeline fails, it should fail as early as possible and produce a clear, actionable error message. A developer should be able to read the failure output and know exactly what to fix without digging through logs.

Parallelize where possible

Stages that do not depend on each other should run in parallel. Security scans can run alongside integration tests. Linting can run alongside compilation. Parallelization is the most effective way to reduce pipeline duration without removing checks.

Pipeline as code

The pipeline definition lives in the same repository as the application it builds and deploys. This gives the team full ownership and allows the pipeline to evolve alongside the application.

Observability

Instrument the pipeline itself with metrics and monitoring. Track:

Lead time - time from commit to production deployment
Pipeline duration - time from pipeline start to completion
Failure rate - percentage of pipeline runs that fail
Recovery time - time from failure detection to successful re-run
Queue time - time spent waiting before the pipeline starts

These metrics identify bottlenecks and measure improvement over time.

Anti-Patterns

The “grand redesign”

Attempting to redesign the entire pipeline at once, rather than iteratively improving the biggest constraint, is a common failure mode. Grand redesigns take too long, introduce too much risk, and often fail to address the actual problems.

Central pipeline teams that own all pipelines

A central team that controls all pipeline definitions creates a bottleneck. Application teams wait for changes, cannot customize pipelines for their context, and are disconnected from their own delivery process.

Optimizing non-constraints

Speeding up a pipeline stage that is not the bottleneck does not improve overall delivery time. Measure before optimizing.

Monolithic pipeline for microservices

Running all microservices through a single pipeline that builds and deploys everything together defeats the purpose of a microservice architecture. Each service should have its own independent pipeline.

How to Get Started

Step 1: Assess your current state

Determine which architecture state - entangled, tightly coupled, or loosely coupled - best describes your current pipeline. Be honest about where you are.

Step 2: Measure your pipeline

Instrument your pipeline to measure duration, failure rates, queue times, and bottlenecks. You cannot improve what you do not measure.

Step 3: Identify the top constraint

Using your measurements, identify the single biggest bottleneck in your pipeline. This is where you focus first.

Step 4: Apply the Theory of Constraints cycle

Exploit, subordinate, and if necessary elevate the constraint. Then measure again and identify the next constraint.

Step 5: Evolve toward loose coupling

With each improvement cycle, move toward independent, team-owned pipelines that can build, test, and deploy services independently. This is a journey of months or years, not days.

Connection to the Pipeline Phase

Pipeline architecture is where all the other practices in this phase come together. The single path to production defines the route. The deterministic pipeline ensures reliability. The deployable definition defines the quality gates. The architecture determines how these elements are organized, sequenced, and optimized for flow.

As teams mature their pipeline architecture toward loose coupling, they build the foundation for Phase 3: Optimize - where the focus shifts from building the pipeline to improving its speed and reliability.

3.3.8 - Rollback

Enable fast recovery from any deployment by maintaining the ability to roll back.

Phase 2 - Pipeline | Adapted from MinimumCD.org

Definition

Rollback is the ability to quickly and safely revert a production deployment to a previous known-good state. It is the safety net that makes continuous delivery possible: because you can always undo a deployment, deploying becomes a low-risk, routine operation.

Rollback is not a backup plan for when things go catastrophically wrong. It is a standard operational capability that should be exercised regularly and trusted completely. Every deployment to production should be accompanied by a tested, automated, fast rollback mechanism.

Why It Matters for CD Migration

Fear of deployment is the single biggest cultural barrier to continuous delivery. Teams that have experienced painful, irreversible deployments develop a natural aversion to deploying frequently. They batch changes, delay releases, and add manual approval gates - all of which slow delivery and increase risk.

Reliable, fast rollback breaks this cycle. When the team knows that any deployment can be reversed in minutes, the perceived risk of deployment drops dramatically. Smaller, more frequent deployments become possible. The feedback loop tightens. The entire delivery system improves.

Key Principles

Fast

Rollback must complete in minutes, not hours. A rollback that takes an hour to execute is not a rollback - it is a prolonged outage with a recovery plan. Target rollback times of 5 minutes or less for the deployment mechanism itself. If the previous artifact is already in the artifact repository and the deployment mechanism is automated, there is no reason rollback should take longer than a fresh deployment.

Automated

Rollback must be a single command or a single click - or better, fully automated based on health checks. It should not require:

SSH access to production servers
Manual editing of configuration files
Running scripts with environment-specific parameters from memory
Coordinating multiple teams to roll back multiple services simultaneously

Safe

Rollback must not make things worse. This means:

Rolling back must not lose data
Rolling back must not corrupt state
Rolling back must not break other services that depend on the rolled-back service
Rolling back must not require downtime beyond what the deployment mechanism itself imposes

Simple

The rollback procedure should be understandable by any team member, including those who did not perform the original deployment. It should not require specialized knowledge, deep system understanding, or heroic troubleshooting.

Tested

Rollback must be tested regularly, not just documented. A rollback procedure that has never been exercised is a rollback procedure that will fail when you need it most. Include rollback verification in your deployable definition and practice rollback as part of routine deployment validation.

Rollback Strategies

Blue-Green Deployment

Maintain two identical production environments - blue and green. At any time, one is live (serving traffic) and the other is idle. To deploy, deploy to the idle environment, verify it, and switch traffic. To roll back, switch traffic back to the previous environment.

Advantages:

Rollback is instantaneous - just a traffic switch
The previous version remains running and warm
Zero-downtime deployment and rollback

Considerations:

Requires double the infrastructure (though the idle environment can be scaled down)
Database changes must be backward-compatible across both versions
Session state must be externalized so it survives the switch

Canary Deployment

Deploy the new version to a small subset of production infrastructure (the “canary”) and route a percentage of traffic to it. Monitor the canary for errors, latency, and business metrics. If the canary is healthy, gradually increase traffic. If problems appear, route all traffic back to the previous version.

Advantages:

Limits blast radius - problems affect only a fraction of users
Provides real production data for validation before full rollout
Rollback is fast - stop sending traffic to the canary

Considerations:

Requires traffic routing infrastructure (service mesh, load balancer configuration)
Both versions must be able to run simultaneously
Monitoring must be sophisticated enough to detect subtle problems in the canary

Feature Flag Rollback

When a deployment introduces new behavior behind a feature flag, rollback can be as simple as turning off the flag. The code remains deployed, but the new behavior is disabled. This is the fastest possible rollback - it requires no deployment at all.

Advantages:

Instantaneous - no deployment, no traffic switch
Granular - roll back a single feature without affecting other changes
No infrastructure changes required

Considerations:

Requires a feature flag system with runtime toggle capability
Only works for changes that are behind flags
Feature flag debt (old flags that are never cleaned up) must be managed

Database-Safe Rollback with Expand-Contract

Database schema changes are the most common obstacle to rollback. If a deployment changes the database schema, rolling back the application code may fail if the old code is incompatible with the new schema.

The expand-contract pattern (also called parallel change) solves this:

Expand - add new columns, tables, or structures alongside the existing ones. The old application code continues to work. Deploy this change.
Migrate - update the application to write to both old and new structures, and read from the new structure. Deploy this change. Backfill historical data.
Contract - once all application versions using the old structure are retired, remove the old columns or tables. Deploy this change.

At every step, the previous application version remains compatible with the current database schema. Rollback is always safe.

Anti-pattern: Destructive schema changes (dropping columns, renaming tables, changing types) deployed simultaneously with the application code change that requires them. This makes rollback impossible because the old code cannot work with the new schema.

Anti-Patterns

“We’ll fix forward”

Relying exclusively on fixing forward (deploying a new fix rather than rolling back) is dangerous when the system is actively degraded. Fix-forward should be an option when the issue is well-understood and the fix is quick. Rollback should be the default when the issue is unclear or the fix will take time. Both capabilities must exist.

Rollback as a documented procedure only

A rollback procedure that exists only in a runbook, wiki, or someone’s memory is not a reliable rollback capability. Procedures that are not automated and regularly tested will fail under the pressure of a production incident.

Coupled service rollbacks

When rolling back service A requires simultaneously rolling back services B and C, you do not have independent rollback capability. Design services to be backward-compatible so that each service can be rolled back independently.

Destructive database migrations

Schema changes that destroy data or break backward compatibility make rollback impossible. Always use the expand-contract pattern for schema changes.

Manual rollback requiring specialized knowledge

If only one person on the team knows how to perform a rollback, the team does not have a rollback capability - it has a single point of failure. Rollback must be simple enough for any team member to execute.

Good Patterns

Automated rollback on health check failure

Configure the deployment system to automatically roll back if the new version fails health checks within a defined window after deployment. This removes the need for a human to detect the problem and initiate the rollback.

Rollback testing in staging

As part of every deployment to staging, deploy the new version, verify it, then roll it back and verify the rollback. This ensures that rollback works for every release, not just in theory.

Artifact retention

Retain previous artifact versions in the artifact repository so that rollback is always possible. Define a retention policy (for example, keep the last 10 production-deployed versions) and ensure that rollback targets are always available.

Deployment log and audit trail

Maintain a clear record of what is currently deployed, what was previously deployed, and when changes occurred. This makes it easy to identify the correct rollback target and verify that the rollback was successful.

Rollback runbook exercises

Regularly practice rollback as a team exercise - not just as part of automated testing, but as a deliberate drill. This builds team confidence and identifies gaps in the process.

How to Get Started

Step 1: Document your current rollback capability

Can you roll back your current production deployment right now? How long would it take? Who would need to be involved? What could go wrong? Be honest about the answers.

Step 2: Implement a basic automated rollback

Start with the simplest mechanism available for your deployment platform - redeploying the previous container image, switching a load balancer target, or reverting a Kubernetes deployment. Automate this as a single command.

Step 3: Test the rollback

Deploy a change to staging, then roll it back. Verify that the system returns to its previous state. Make this a standard part of your deployment validation.

Step 4: Address database compatibility

Audit your database migration practices. If you are making destructive schema changes, shift to the expand-contract pattern. Ensure that the previous application version is always compatible with the current database schema.

Step 5: Reduce rollback time

Measure how long rollback takes. Identify and eliminate delays - slow artifact downloads, slow startup times, manual steps. Target rollback completion in under 5 minutes.

Step 6: Build team confidence

Practice rollback regularly. Demonstrate it during deployment reviews. Make it a normal part of operations, not an emergency procedure. When the team trusts rollback, they will trust deployment.

Connection to the Pipeline Phase

Rollback is the capstone of the Pipeline phase. It is what makes the rest of the phase safe:

The single path to production is how rollback is deployed - the same pipeline, the same path, in reverse
Immutable artifacts are what make rollback reliable - the previous artifact is unchanged in the artifact repository, ready to be redeployed
The deployable definition should include rollback verification as one of its quality gates
Application configuration separation ensures that rolling back the artifact does not require rolling back environment configuration
Production-like environments are where rollback is tested before it is needed in production

With rollback in place, the team has the confidence to deploy frequently, which is the foundation for Phase 3: Optimize and ultimately Phase 4: Deliver on Demand.

3.4 - Phase 3: Optimize

Improve flow by reducing batch size, limiting work in progress, and using metrics to drive improvement.

Key question: “Can we deliver small changes quickly?”

With a working pipeline in place, this phase focuses on optimizing the flow of changes through it. Smaller batches, feature flags, and WIP limits reduce risk and increase delivery frequency.

What You’ll Do

Reduce batch size - Deliver smaller, more frequent changes
Use feature flags - Decouple deployment from release
Limit work in progress - Focus on finishing over starting
Drive improvement with metrics - Use DORA metrics and improvement kata
Run effective retrospectives - Continuously improve the delivery process
Decouple architecture - Enable independent deployment of components

Why This Phase Matters

Having a pipeline isn’t enough - you need to optimize the flow through it. Teams that deploy weekly with a CD pipeline are missing most of the benefits. Small batches reduce risk, feature flags enable testing in production, and metrics-driven improvement creates a virtuous cycle of getting better at getting better.

When You’re Ready to Move On

You’re ready for Phase 4: Deliver on Demand when:

Most changes are small enough to deploy independently
Feature flags let you deploy incomplete features safely
Your WIP limits keep work flowing without bottlenecks
You’re measuring and improving your DORA metrics regularly

3.4.1 - Small Batches

Deliver smaller, more frequent changes to reduce risk and increase feedback speed.

Phase 3 - Optimize | Adapted from MinimumCD.org

Batch size is the single biggest lever for improving delivery performance. This page covers what batch size means at every level - deploy frequency, commit size, and story size - and provides concrete techniques for reducing it.

Why Batch Size Matters

Large batches create large risks. When you deploy 50 changes at once, any failure could be caused by any of those 50 changes. When you deploy 1 change, the cause of any failure is obvious.

This is not a theory. The DORA research consistently shows that elite teams deploy more frequently, with smaller changes, and have both higher throughput and lower failure rates. Small batches are the mechanism that makes this possible.

Three Levels of Batch Size

Batch size is not just about deployments. It operates at three distinct levels, and optimizing only one while ignoring the others limits your improvement.

Level 1: Deploy Frequency

How often you push changes to production.

State	Deploy Frequency	Risk Profile
Starting	Monthly or quarterly	Each deploy is a high-stakes event
Improving	Weekly	Deploys are planned but routine
Optimizing	Daily	Deploys are unremarkable
Elite	Multiple times per day	Deploys are invisible

How to reduce: Remove manual gates, automate approval workflows, build confidence through progressive rollout. If your pipeline is reliable (Phase 2), the only thing preventing more frequent deploys is organizational habit.

Level 2: Commit Size

How much code changes in each commit to trunk.

Indicator	Too Large	Right-Sized
Files changed	20+ files	1-5 files
Lines changed	500+ lines	Under 100 lines
Review time	Hours or days	Minutes
Merge conflicts	Frequent	Rare
Description length	Paragraph needed	One sentence suffices

How to reduce: Practice TDD (write one test, make it pass, commit). Use feature flags to merge incomplete work. Pair program so review happens in real time.

Level 3: Story Size

How much scope each user story or work item contains.

A story that takes a week to complete is a large batch. It means a week of work piles up before integration, a week of assumptions go untested, and a week of inventory sits in progress.

Target: Every story should be completable - coded, tested, reviewed, and integrated - in two days or less. If it cannot be, it needs to be decomposed further.

Behavior-Driven Development for Decomposition

BDD provides a concrete technique for breaking stories into small, testable increments. The Given-When-Then format forces clarity about scope.

The Given-When-Then Pattern

Feature: Shopping cart discount

  Scenario: Apply percentage discount to cart
    Given a cart with items totaling $100
    When I apply a 10% discount code
    Then the cart total should be $90

  Scenario: Reject expired discount code
    Given a cart with items totaling $100
    When I apply an expired discount code
    Then the cart total should remain $100
    And I should see "This discount code has expired"

  Scenario: Apply discount only to eligible items
    Given a cart with one eligible item at $50 and one ineligible item at $50
    When I apply a 10% discount code
    Then the cart total should be $95

Each scenario becomes a deliverable increment. You can implement and deploy the first scenario before starting the second. This is how you turn a “discount feature” (large batch) into three independent, deployable changes (small batches).

Decomposing Stories Using Scenarios

When a story has too many scenarios, it is too large. Use this process:

Write all the scenarios first. Before any code, enumerate every Given-When-Then for the story.
Group scenarios into deliverable slices. Each slice should be independently valuable or at least independently deployable.
Create one story per slice. Each story has 1-3 scenarios and can be completed in 1-2 days.
Order the slices by value. Deliver the most important behavior first.

Example decomposition:

Original Story	Scenarios	Sliced Into
“As a user, I can manage my profile”	12 scenarios covering name, email, password, avatar, notifications, privacy, deactivation	5 stories: basic info (2 scenarios), password (2), avatar (2), notifications (3), deactivation (3)

Vertical Slicing

A vertical slice cuts through all layers of the system to deliver a thin piece of end-to-end functionality. This is the opposite of horizontal slicing, where you build all the database changes, then all the API changes, then all the UI changes.

Horizontal vs. Vertical Slicing

Horizontal (avoid):

Story 1: Build the database schema for discounts
Story 2: Build the API endpoints for discounts
Story 3: Build the UI for applying discounts

Problems: Story 1 and 2 deliver no user value. You cannot test end-to-end until story 3 is done. Integration risk accumulates.

Vertical (prefer):

Story 1: Apply a simple percentage discount (DB + API + UI for one scenario)
Story 2: Reject expired discount codes (DB + API + UI for one scenario)
Story 3: Apply discounts only to eligible items (DB + API + UI for one scenario)

Benefits: Every story delivers testable, deployable functionality. Integration happens with each story, not at the end. You can ship story 1 and get feedback before building story 2.

How to Slice Vertically

Ask these questions about each proposed story:

Can a user (or another system) observe the change? If not, slice differently.
Can I write an end-to-end test for it? If not, the slice is incomplete.
Does it require all other slices to be useful? If yes, find a thinner first slice.
Can it be deployed independently? If not, check whether feature flags could help.

Practical Steps for Reducing Batch Size

Week 1-2: Measure Current State

Before changing anything, measure where you are:

Average commit size (lines changed per commit)
Average story cycle time (time from start to done)
Deploy frequency (how often changes reach production)
Average changes per deploy (how many commits per deployment)

Week 3-4: Introduce Story Decomposition

Start writing BDD scenarios before implementation
Split any story estimated at more than 2 days
Track the number of stories completed per week (expect this to increase as stories get smaller)

Week 5-8: Tighten Commit Size

Adopt the discipline of “one logical change per commit”
Use TDD to create a natural commit rhythm: write test, make it pass, commit
Track average commit size and set a team target (e.g., under 100 lines)

Ongoing: Increase Deploy Frequency

Deploy at least once per day, then work toward multiple times per day
Remove any batch-oriented processes (e.g., “we deploy on Tuesdays”)
Make deployment a non-event

Key Pitfalls

1. “Small stories take more overhead to manage”

This is true only if your process adds overhead per story (e.g., heavyweight estimation ceremonies, multi-level approval). The solution is to simplify the process, not to keep stories large. Overhead per story should be near zero for a well-decomposed story.

2. “Some things can’t be done in small batches”

Almost anything can be decomposed further. Database migrations can be done in backward-compatible steps. API changes can use versioning. UI changes can be hidden behind feature flags. The skill is in finding the decomposition, not in deciding whether one exists.

3. “We tried small stories but our throughput dropped”

This usually means the team is still working sequentially. Small stories require limiting WIP and swarming - see Limiting WIP. If the team starts 10 small stories instead of 2 large ones, they have not actually reduced batch size; they have increased WIP.

Measuring Success

Metric	Target	Why It Matters
Development cycle time	< 2 days per story	Confirms stories are small enough to complete quickly
Integration frequency	Multiple times per day	Confirms commits are small and frequent
Release frequency	Daily or more	Confirms deploys are routine
Change fail rate	Decreasing	Confirms small changes reduce failure risk

Next Step

Small batches often require deploying incomplete features to production. Feature Flags provide the mechanism to do this safely.

3.4.2 - Feature Flags

Decouple deployment from release by using feature flags to control feature visibility.

Phase 3 - Optimize | Adapted from MinimumCD.org

Feature flags are the mechanism that makes trunk-based development and small batches safe. They let you deploy code to production without exposing it to users, enabling dark launches, gradual rollouts, and instant rollback of features without redeploying.

Why Feature Flags?

In continuous delivery, deployment and release are two separate events:

Deployment is pushing code to production.
Release is making a feature available to users.

Feature flags are the bridge between these two events. They let you deploy frequently (even multiple times a day) without worrying about exposing incomplete or untested features. This separation is what makes continuous deployment possible for teams that ship real products to real users.

When You Need Feature Flags (and When You Don’t)

Not every change requires a feature flag. Flags add complexity, and unnecessary complexity slows you down. Use this decision tree to determine the right approach.

Decision Tree

Is the change user-visible?
├── No → Deploy without a flag
│         (refactoring, performance improvements, dependency updates)
│
└── Yes → Can it be completed and deployed in a single small batch?
          ├── Yes → Deploy without a flag
          │         (bug fixes, copy changes, small UI tweaks)
          │
          └── No → Is there a seam in the code where you can introduce the change?
                   ├── Yes → Consider Branch by Abstraction
                   │         (replacing a subsystem, swapping an implementation)
                   │
                   └── No → Is it a new feature with a clear entry point?
                            ├── Yes → Use a Feature Flag
                            │
                            └── No → Consider Connect Tests Last
                                     (build the internals first, wire them up last)

Alternatives to Feature Flags

Technique	How It Works	When to Use
Branch by Abstraction	Introduce an abstraction layer, build the new implementation behind it, switch when ready	Replacing an existing subsystem or library
Connect Tests Last	Build internal components without connecting them to the UI or API	New backend functionality that has no user-facing impact until connected
Dark Launch	Deploy the code path but do not route any traffic to it	New infrastructure, new services, or new endpoints that are not yet referenced

These alternatives avoid the lifecycle overhead of feature flags while still enabling trunk-based development with incomplete work.

Implementation Approaches

Feature flags can be implemented at different levels of sophistication. Start simple and add complexity only when needed.

Level 1: Static Code-Based Flags

The simplest approach: a boolean constant or configuration value checked in code.

# config.py
FEATURE_NEW_CHECKOUT = False

# checkout.py
from config import FEATURE_NEW_CHECKOUT

def process_checkout(cart, user):
    if FEATURE_NEW_CHECKOUT:
        return new_checkout_flow(cart, user)
    else:
        return legacy_checkout_flow(cart, user)

Pros: Zero infrastructure. Easy to understand. Works everywhere.

Cons: Changing a flag requires a deployment. No per-user targeting. No gradual rollout.

Best for: Teams starting out. Internal tools. Changes that will be fully on or fully off.

Level 2: Dynamic In-Process Flags

Flags stored in a configuration file, database, or environment variable that can be changed at runtime without redeploying.

# flag_service.py
import json

class FeatureFlags:
    def __init__(self, config_path="/etc/flags.json"):
        self._config_path = config_path

    def is_enabled(self, flag_name, context=None):
        flags = json.load(open(self._config_path))
        flag = flags.get(flag_name, {})

        if not flag.get("enabled", False):
            return False

        # Percentage rollout
        if "percentage" in flag and context and "user_id" in context:
            return (hash(context["user_id"]) % 100) < flag["percentage"]

        return True

{
  "new-checkout": {
    "enabled": true,
    "percentage": 10
  }
}

Pros: No redeployment needed. Supports percentage rollout. Simple to implement.

Cons: Each instance reads its own config - no centralized view. Limited targeting capabilities.

Best for: Teams that need gradual rollout but do not want to adopt a third-party service yet.

Level 3: Centralized Flag Service

A dedicated service (self-hosted or SaaS) that manages all flags, provides a dashboard, supports targeting rules, and tracks flag usage.

Examples: LaunchDarkly, Unleash, Flagsmith, Split, or a custom internal service.

from feature_flag_client import FlagClient

client = FlagClient(api_key="...")

def process_checkout(cart, user):
    if client.is_enabled("new-checkout", user_context={"id": user.id, "plan": user.plan}):
        return new_checkout_flow(cart, user)
    else:
        return legacy_checkout_flow(cart, user)

Pros: Centralized management. Rich targeting (by user, plan, region, etc.). Audit trail. Real-time changes.

Cons: Added dependency. Cost (for SaaS). Network latency for flag evaluation (mitigated by local caching in most SDKs).

Best for: Teams at scale. Products with diverse user segments. Regulated environments needing audit trails.

Level 4: Infrastructure Routing

Instead of checking flags in application code, route traffic at the infrastructure level (load balancer, service mesh, API gateway).

# Istio VirtualService example
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: checkout-service
spec:
  hosts:
    - checkout
  http:
    - match:
        - headers:
            x-feature-group:
              exact: "beta"
      route:
        - destination:
            host: checkout-v2
    - route:
        - destination:
            host: checkout-v1

Pros: No application code changes. Clean separation of routing from logic. Works across services.

Cons: Requires infrastructure investment. Less granular than application-level flags. Harder to target individual users.

Best for: Microservice architectures. Service-level rollouts. A/B testing at the infrastructure layer.

Feature Flag Lifecycle

Every feature flag has a lifecycle. Flags that are not actively managed become technical debt. Follow this lifecycle rigorously.

The Six Stages

1. CREATE       → Define the flag, document its purpose and owner
2. DEPLOY OFF   → Code ships to production with the flag disabled
3. BUILD        → Incrementally add functionality behind the flag
4. DARK LAUNCH  → Enable for internal users or a small test group
5. ROLLOUT      → Gradually increase the percentage of users
6. REMOVE       → Delete the flag and the old code path

Stage 1: Create

Before writing any code, define the flag:

Name: Use a consistent naming convention (e.g., enable-new-checkout, feature.discount-engine)
Owner: Who is responsible for this flag through its lifecycle?
Purpose: One sentence describing what the flag controls
Planned removal date: Set this at creation time. Flags without removal dates become permanent.

Stage 2: Deploy OFF

The first deployment includes the flag check but the flag is disabled. This verifies that:

The flag infrastructure works
The default (off) path is unaffected
The flag check does not introduce performance issues

Stage 3: Build Incrementally

Continue building the feature behind the flag over multiple deploys. Each deploy adds more functionality, but the flag remains off for users. Test both paths in your automated suite:

@pytest.mark.parametrize("flag_enabled", [True, False])
def test_checkout_with_flag(flag_enabled, monkeypatch):
    monkeypatch.setattr(flags, "is_enabled", lambda name, ctx=None: flag_enabled)
    result = process_checkout(cart, user)
    assert result.status == "success"

Stage 4: Dark Launch

Enable the flag for internal users or a specific test group. This is your first validation with real production data and real traffic patterns. Monitor:

Error rates for the flagged group vs. control
Performance metrics (latency, throughput)
Business metrics (conversion, engagement)

Stage 5: Gradual Rollout

Increase exposure systematically:

Step	Audience	Duration	What to Watch
1	1% of users	1-2 hours	Error rates, latency
2	5% of users	4-8 hours	Performance at slightly higher load
3	25% of users	1 day	Business metrics begin to be meaningful
4	50% of users	1-2 days	Statistically significant business impact
5	100% of users	-	Full rollout

At any step, if metrics degrade, roll back by disabling the flag. No redeployment needed.

Stage 6: Remove

This is the most commonly skipped step, and skipping it creates significant technical debt.

Once the feature has been stable at 100% for an agreed period (e.g., 2 weeks):

Remove the flag check from code
Remove the old code path
Remove the flag definition from the flag service
Deploy the simplified code

Set a maximum flag lifetime. A common practice is 90 days. Any flag older than 90 days triggers an automatic review. Stale flags are a maintenance burden and a source of confusion.

Key Pitfalls

1. “We have 200 feature flags and nobody knows what they all do”

This is flag debt, and it is as damaging as any other technical debt. Prevent it by enforcing the lifecycle: every flag has an owner, a purpose, and a removal date. Run a monthly flag audit.

2. “We use flags for everything, including configuration”

Feature flags and configuration are different concerns. Flags are temporary (they control unreleased features). Configuration is permanent (it controls operational behavior like timeouts, connection pools, log levels). Mixing them leads to confusion about what can be safely removed.

3. “Testing both paths doubles our test burden”

It does increase test effort, but this is a temporary cost. When the flag is removed, the extra tests go away too. The alternative - deploying untested code paths - is far more expensive.

4. “Nested flags create combinatorial complexity”

Avoid nesting flags whenever possible. If feature B depends on feature A, do not create a separate flag for B. Instead, extend the behavior behind feature A’s flag. If you must nest, document the dependency and test the specific combinations that matter.

Measuring Success

Metric	Target	Why It Matters
Active flag count	Stable or decreasing	Confirms flags are being removed, not accumulating
Average flag age	< 90 days	Catches stale flags before they become permanent
Flag-related incidents	Near zero	Confirms flag management is not causing problems
Time from deploy to release	Hours to days (not weeks)	Confirms flags enable fast, controlled releases

Next Step

Small batches and feature flags let you deploy more frequently, but deploying more means more work in progress. Limiting WIP ensures that increased deploy frequency does not create chaos.

3.4.3 - Limiting Work in Progress

Focus on finishing work over starting new work to improve flow and reduce cycle time.

Phase 3 - Optimize | Adapted from Dojo Consortium

Work in progress (WIP) is inventory. Like physical inventory, it loses value the longer it sits unfinished. Limiting WIP is the most counterintuitive and most impactful practice in this entire migration: doing less work at once makes you deliver more.

Why Limiting WIP Matters

Every item of work in progress has a cost:

Context switching: Moving between tasks destroys focus. Research consistently shows that switching between two tasks reduces productive time by 20-40%.
Delayed feedback: Work that is started but not finished cannot be validated by users. The longer it sits, the more assumptions go untested.
Hidden dependencies: The more items in progress simultaneously, the more likely they are to conflict, block each other, or require coordination.
Longer cycle time: Little’s Law states that cycle time = WIP / throughput. If throughput is constant, the only way to reduce cycle time is to reduce WIP.

How to Set Your WIP Limit

The N+2 Starting Point

A practical starting WIP limit for a team is N+2, where N is the number of team members actively working on delivery.

Team Size	Starting WIP Limit	Rationale
3 developers	5 items	Allows one item per person plus a small buffer
5 developers	7 items	Same principle at larger scale
8 developers	10 items	Buffer becomes proportionally smaller

Why N+2 and not N? Because some items will be blocked waiting for review, testing, or external dependencies. A small buffer prevents team members from being idle when their primary task is blocked. But the buffer should be small - two items, not ten.

Continuously Lower the Limit

The N+2 formula is a starting point, not a destination. Once the team is comfortable with the initial limit, reduce it:

Start at N+2. Run for 2-4 weeks. Observe where work gets stuck.
Reduce to N+1. Tighten the limit. Some team members will occasionally be “idle” - this is a feature, not a bug. They should swarm on blocked items.
Reduce to N. At this point, every team member is working on exactly one thing. Blocked work gets immediate attention because someone is always available to help.
Consider going below N. Some teams find that pairing (two people, one item) further reduces cycle time. A team of 6 with a WIP limit of 3 means everyone is pairing.

Each reduction will feel uncomfortable. That discomfort is the point - it exposes problems in your workflow that were previously hidden by excess WIP.

What Happens When You Hit the Limit

When the team reaches its WIP limit and someone finishes a task, they have two options:

Pull the next highest-priority item (if the WIP limit allows it).
Swarm on an existing item that is blocked, stuck, or nearing its cycle time target.

When the WIP limit is reached and no items are complete:

Do not start new work. This is the hardest part and the most important.
Help unblock existing work. Pair with someone. Review a pull request. Write a missing test. Talk to the person who has the answer to the blocking question.
Improve the process. If nothing is blocked but everything is slow, this is the time to work on automation, tooling, or documentation.

Swarming

Swarming is the practice of multiple team members working together on a single item to get it finished faster. It is the natural complement to WIP limits.

When to Swarm

An item has been in progress for longer than the team’s cycle time target (e.g., more than 2 days)
An item is blocked and the blocker can be resolved by another team member
The WIP limit is reached and someone needs work to do
A critical defect needs to be fixed immediately

How to Swarm Effectively

Approach	How It Works	Best For
Pair programming	Two developers work on the same item at the same machine	Complex logic, knowledge transfer, code that needs review
Mob programming	The whole team works on one item together	Critical path items, complex architectural decisions
Divide and conquer	Break the item into sub-tasks and assign them	Items that can be parallelized (e.g., frontend + backend + tests)
Unblock and return	One person resolves the blocker, then hands back	External dependencies, environment issues, access requests

Why Teams Resist Swarming

The most common objection: “It’s inefficient to have two people on one task.” This is only true if you measure efficiency as “percentage of time each person is writing new code.” If you measure efficiency as “how quickly value reaches production,” swarming is almost always faster because it reduces handoffs, wait time, and rework.

How Limiting WIP Exposes Workflow Issues

One of the most valuable effects of WIP limits is that they make hidden problems visible. When you cannot start new work, you are forced to confront the problems that slow existing work down.

Symptom When WIP Is Limited	Root Cause Exposed
“I’m idle because my PR is waiting for review”	Code review process is too slow
“I’m idle because I’m waiting for the test environment”	Not enough environments, or environments are not self-service
“I’m idle because I’m waiting for the product owner to clarify requirements”	Stories are not refined before being pulled into the sprint
“I’m idle because my build is broken and I can’t figure out why”	Build is not deterministic, or test suite is flaky
“I’m idle because another team hasn’t finished the API I depend on”	Architecture is too tightly coupled (see Architecture Decoupling)

Each of these is a bottleneck that was previously invisible because the team could always start something else. With WIP limits, these bottlenecks become obvious and demand attention.

Implementing WIP Limits

Step 1: Make WIP Visible (Week 1)

Before setting limits, make current WIP visible:

Count the number of items currently “in progress” for the team
Write this number on the board (physical or digital) every day
Most teams are shocked by how high it is. A team of 5 often has 15-20 items in progress.

Step 2: Set the Initial Limit (Week 2)

Calculate N+2 for your team
Add the limit to your board (e.g., a column header that says “In Progress (limit: 7)”)
Agree as a team that when the limit is reached, no new work starts

Step 3: Enforce the Limit (Week 3+)

When someone tries to pull new work and the limit is reached, the team helps them find an existing item to work on
Track violations: how often does the team exceed the limit? What causes it?
Discuss in retrospectives: Is the limit too high? Too low? What bottlenecks are exposed?

Step 4: Reduce the Limit (Monthly)

Every month, consider reducing the limit by 1
Each reduction will expose new bottlenecks - this is the intended effect
Stop reducing when the team reaches a sustainable flow where items move from start to done predictably

Key Pitfalls

1. “We set a WIP limit but nobody enforces it”

A WIP limit that is not enforced is not a WIP limit. Enforcement requires a team agreement and a visible mechanism. If the board shows 10 items in progress and the limit is 7, the team should stop and address it immediately. This is a working agreement, not a suggestion.

2. “Developers are idle and management is uncomfortable”

This is the most common failure mode. Management sees “idle” developers and concludes WIP limits are wasteful. In reality, those “idle” developers are either swarming on existing work (which is productive) or the team has hit a genuine bottleneck that needs to be addressed. The discomfort is a signal that the system needs improvement.

3. “We have WIP limits but we also have expedite lanes for everything”

If every urgent request bypasses the WIP limit, you do not have a WIP limit. Expedite lanes should be rare - one per week at most. If everything is urgent, nothing is.

4. “We limit WIP per person but not per team”

Per-person WIP limits miss the point. The goal is to limit team WIP so that team members are incentivized to help each other. A per-person limit of 1 with no team limit still allows the team to have 8 items in progress simultaneously with no swarming.

Measuring Success

Metric	Target	Why It Matters
Work in progress	At or below team limit	Confirms the limit is being respected
Development cycle time	Decreasing	Confirms that less WIP leads to faster delivery
Items completed per week	Stable or increasing	Confirms that finishing more, starting less works
Time items spend blocked	Decreasing	Confirms bottlenecks are being addressed

Next Step

WIP limits expose problems. Metrics-Driven Improvement provides the framework for systematically addressing them.

3.4.4 - Metrics-Driven Improvement

Use DORA metrics and improvement kata to drive systematic delivery improvement.

Phase 3 - Optimize | Original content combining DORA recommendations and improvement kata

Improvement without measurement is guesswork. This page combines the DORA four key metrics with the improvement kata pattern to create a systematic, repeatable approach to getting better at delivery.

The Problem with Ad Hoc Improvement

Most teams improve accidentally. Someone reads a blog post, suggests a change at standup, and the team tries it for a week before forgetting about it. This produces sporadic, unmeasurable progress that is impossible to sustain.

Metrics-driven improvement replaces this with a disciplined cycle: measure where you are, define where you want to be, run a small experiment, measure the result, and repeat. The improvement kata provides the structure. DORA metrics provide the measures.

The Four DORA Metrics

The DORA research program (now part of Google Cloud) has identified four key metrics that predict software delivery performance. These are the metrics you should track throughout your CD migration.

1. Deployment Frequency

How often your team deploys to production.

Performance Level	Deployment Frequency
Elite	On-demand (multiple deploys per day)
High	Between once per day and once per week
Medium	Between once per week and once per month
Low	Between once per month and once every six months

What it tells you: How comfortable your team and pipeline are with deploying. Low frequency usually indicates manual gates, fear of deployment, or large batch sizes.

How to measure: Count the number of successful deployments to production per unit of time. Automated deploys count. Hotfixes count. Rollbacks do not.

2. Lead Time for Changes

The time from a commit being pushed to trunk to that commit running in production.

Performance Level	Lead Time
Elite	Less than one hour
High	Between one day and one week
Medium	Between one week and one month
Low	Between one month and six months

What it tells you: How efficient your pipeline is. Long lead times indicate slow builds, manual approval steps, or infrequent deployment windows.

How to measure: Record the timestamp when a commit merges to trunk and the timestamp when that commit is running in production. The difference is lead time. Track the median, not the mean (outliers distort the mean).

3. Change Failure Rate

The percentage of deployments that cause a failure in production requiring remediation (rollback, hotfix, or patch).

Performance Level	Change Failure Rate
Elite	0-15%
High	16-30%
Medium	16-30%
Low	46-60%

What it tells you: How effective your testing and validation pipeline is. High failure rates indicate gaps in test coverage, insufficient pre-production validation, or overly large changes.

How to measure: Track deployments that result in a degraded service, require rollback, or need a hotfix. Divide by total deployments. A “failure” is defined by the team - typically any incident that requires immediate human intervention.

4. Mean Time to Restore (MTTR)

How long it takes to recover from a failure in production.

Performance Level	Time to Restore
Elite	Less than one hour
High	Less than one day
Medium	Less than one day
Low	Between one week and one month

What it tells you: How resilient your system and team are. Long recovery times indicate manual rollback processes, poor observability, or insufficient incident response practices.

How to measure: Record the timestamp when a production failure is detected and the timestamp when service is fully restored. Track the median.

The DORA Capabilities

Behind these four metrics are 24 capabilities that the DORA research has shown to drive performance. They organize into five categories. Use this as a diagnostic tool: when a metric is lagging, look at the related capabilities to identify what to improve.

Continuous Delivery Capabilities

These directly affect your pipeline and deployment practices:

Version control for all production artifacts
Automated deployment processes
Continuous integration
Trunk-based development
Test automation
Test data management
Shift-left security
Continuous delivery (the ability to deploy at any time)

Architecture Capabilities

These affect how easily your system can be changed and deployed:

Loosely coupled architecture
Empowered teams that can choose their own tools
Teams that can test, deploy, and release independently

Product and Process Capabilities

These affect how work flows through the team:

Customer feedback loops
Value stream visibility
Working in small batches
Team experimentation

Lean Management Capabilities

These affect how the organization supports delivery:

Lightweight change approval processes
Monitoring and observability
Proactive notification
WIP limits
Visual management of workflow

Cultural Capabilities

These affect the environment in which teams operate:

Generative organizational culture (Westrum model)
Encouraging and supporting learning
Collaboration within and between teams
Job satisfaction
Transformational leadership

For a detailed breakdown, see the DORA Capabilities reference.

The Improvement Kata

The improvement kata is a four-step pattern from lean manufacturing adapted for software delivery. It provides the structure for turning DORA measurements into concrete improvements.

Step 1: Understand the Direction

Where does your CD migration need to go?

This is already defined by the phases of this migration guide. In Phase 3, your direction is: smaller batches, faster flow, and higher confidence in every deployment.

Step 2: Grasp the Current Condition

Measure your current DORA metrics. Be honest - the point is to understand reality, not to look good.

Practical approach:

Collect two weeks of data for all four DORA metrics
Plot the data - do not just calculate averages. Look at the distribution.
Identify which metric is furthest from your target
Investigate the related capabilities to understand why

Example current condition:

Metric	Current	Target	Gap
Deployment frequency	Weekly	Daily	5x improvement needed
Lead time	3 days	< 1 day	Pipeline is slow or has manual gates
Change failure rate	25%	< 15%	Test coverage or change size issue
MTTR	4 hours	< 1 hour	Rollback is manual

Step 3: Establish the Next Target Condition

Do not try to fix everything at once. Pick one metric and define a specific, measurable, time-bound target.

Good target: “Reduce lead time from 3 days to 1 day within the next 4 weeks.”

Bad target: “Improve our deployment pipeline.” (Too vague, no measure, no deadline.)

Step 4: Experiment Toward the Target

Design a small experiment that you believe will move the metric toward the target. Run it. Measure the result. Adjust.

The experiment format:

Element	Description
Hypothesis	“If we [action], then [metric] will [improve/decrease] because [reason].”
Action	What specifically will you change?
Duration	How long will you run the experiment? (Typically 1-2 weeks)
Measure	How will you know if it worked?
Decision criteria	What result would cause you to keep, modify, or abandon the change?

Example experiment:

The Cycle Repeats

After each experiment:

Measure the result
Update your understanding of the current condition
If the target is met, pick the next metric to improve
If the target is not met, design another experiment

This creates a continuous improvement loop. Each cycle takes 1-2 weeks. Over months, the cumulative effect is dramatic.

Connecting Metrics to Action

When a metric is lagging, use this guide to identify where to focus.

Low Deployment Frequency

Possible Cause	Investigation	Action
Manual approval gates	Map the approval chain	Automate or eliminate non-value-adding approvals
Fear of deployment	Ask the team what they fear	Address the specific fear (usually testing gaps)
Large batch size	Measure changes per deploy	Implement small batches practices
Deploy process is manual	Time the deploy process	Automate the deployment pipeline

Long Lead Time

Possible Cause	Investigation	Action
Slow builds	Time each pipeline stage	Optimize the slowest stage (often tests)
Waiting for environments	Track environment wait time	Implement self-service environments
Waiting for approval	Track approval wait time	Reduce approval scope or automate
Large changes	Measure commit size	Reduce batch size

High Change Failure Rate

Possible Cause	Investigation	Action
Insufficient test coverage	Measure coverage by area	Add tests for the areas that fail most
Tests pass but production differs	Compare test and prod environments	Make environments more production-like
Large, risky changes	Measure change size	Reduce batch size, use feature flags
Configuration drift	Audit configuration differences	Externalize and version configuration

Long MTTR

Possible Cause	Investigation	Action
Rollback is manual	Time the rollback process	Automate rollback
Hard to identify root cause	Review recent incidents	Improve observability and alerting
Hard to deploy fixes quickly	Measure fix lead time	Ensure pipeline supports rapid hotfix deployment
Dependencies fail in cascade	Map failure domains	Improve architecture decoupling

Building a Metrics Dashboard

Make your DORA metrics visible to the team at all times. A dashboard on a wall monitor or a shared link is ideal.

Essential elements:

Current values for all four DORA metrics
Trend lines showing direction over the past 4-8 weeks
Current target condition highlighted
Active experiment description

Keep it simple. A spreadsheet updated weekly is better than a sophisticated dashboard that nobody maintains. The goal is visibility, not tooling sophistication.

Key Pitfalls

1. “We measure but don’t act”

Measurement without action is waste. If you collect metrics but never run experiments, you are creating overhead with no benefit. Every measurement should lead to a hypothesis. Every hypothesis should lead to an experiment.

2. “We use metrics to compare teams”

DORA metrics are for teams to improve themselves, not for management to rank teams. Using metrics for comparison creates incentives to game the numbers. Each team should own its own metrics and its own improvement targets.

3. “We try to improve all four metrics at once”

Focus on one metric at a time. Improving deployment frequency and change failure rate simultaneously often requires conflicting actions. Pick the biggest bottleneck, address it, then move to the next.

4. “We abandon experiments too quickly”

Most experiments need at least two weeks to show results. One bad day is not a reason to abandon an experiment. Set the duration up front and commit to it.

Measuring Success

Indicator	Target	Why It Matters
Experiments per month	2-4	Confirms the team is actively improving
Metrics trending in the right direction	Consistent improvement over 3+ months	Confirms experiments are having effect
Team can articulate current condition and target	Everyone on the team knows	Confirms improvement is a shared concern
Improvement items in backlog	Always present	Confirms improvement is treated as a deliverable

Next Step

Metrics tell you what to improve. Retrospectives provide the team forum for deciding how to improve it.

3.4.5 - Retrospectives

Continuously improve the delivery process through structured reflection.

Phase 3 - Optimize | Adapted from Dojo Consortium

A retrospective is the team’s primary mechanism for turning observations into improvements. Without effective retrospectives, WIP limits expose problems that nobody addresses, metrics trend in the wrong direction with no response, and the CD migration stalls.

Why Retrospectives Matter for CD Migration

Every practice in this guide - trunk-based development, small batches, WIP limits, metrics-driven improvement - generates signals about what is working and what is not. Retrospectives are where the team processes those signals and decides what to change.

Teams that skip retrospectives or treat them as a checkbox exercise consistently stall at whatever maturity level they first reach. Teams that run effective retrospectives continuously improve, week after week, month after month.

The Five-Part Structure

An effective retrospective follows a structured format that prevents it from devolving into a venting session or a status meeting. This five-part structure ensures the team moves from observation to action.

Part 1: Review the Mission (5 minutes)

Start by reminding the team of the larger goal. In the context of a CD migration, this might be:

“Our mission this quarter is to deploy to production at least once per day.”
“We are working toward eliminating manual gates in our pipeline.”
“Our goal is to reduce lead time from 3 days to under 1 day.”

This grounding prevents the retrospective from focusing on minor irritations and keeps the conversation aligned with what matters.

Part 2: Review the KPIs (10 minutes)

Present the team’s current metrics. For a CD migration, these are typically the DORA metrics plus any team-specific measures from Metrics-Driven Improvement.

Metric	Last Period	This Period	Trend
Deployment frequency	3/week	4/week	Improving
Lead time (median)	2.5 days	2.1 days	Improving
Change failure rate	22%	18%	Improving
MTTR	3 hours	3.5 hours	Declining
WIP (average)	8 items	6 items	Improving

Do not skip this step. Without data, the retrospective becomes a subjective debate where the loudest voice wins. With data, the conversation focuses on what the numbers show and what to do about them.

Part 3: Review Experiments (10 minutes)

Review the outcomes of any experiments the team ran since the last retrospective.

For each experiment:

What was the hypothesis? Remind the team what you were testing.
What happened? Present the data.
What did you learn? Even failed experiments teach you something.
What is the decision? Keep, modify, or abandon.

Example:

Part 4: Check Goals (10 minutes)

Review any improvement goals or action items from the previous retrospective.

Completed: Acknowledge and celebrate. This is important - it reinforces that improvement work matters.
In progress: Check for blockers. Does the team need to adjust the approach?
Not started: Why not? Was it deprioritized, blocked, or forgotten? If improvement work is consistently not started, the team is not treating improvement as a deliverable (see below).

Part 5: Open Conversation (25 minutes)

This is the core of the retrospective. The team discusses:

What is working well that we should keep doing?
What is not working that we should change?
What new problems or opportunities have we noticed?

Facilitation techniques for this section:

Technique	How It Works	Best For
Start/Stop/Continue	Each person writes items in three categories	Quick, structured, works with any team
4Ls (Liked, Learned, Lacked, Longed For)	Broader categories that capture emotional responses	Teams that need to process frustration or celebrate wins
Timeline	Plot events on a timeline and discuss turning points	After a particularly eventful sprint or incident
Dot voting	Everyone gets 3 votes to prioritize discussion topics	When there are many items and limited time

From Conversation to Commitment

The open conversation must produce concrete action items. Vague commitments like “we should communicate better” are worthless. Good action items are:

Specific: “Add a Slack notification when the build breaks” (not “improve communication”)
Owned: “Alex will set this up by Wednesday” (not “someone should do this”)
Measurable: “We will know this worked if build break response time drops below 10 minutes”
Time-bound: “We will review the result at the next retrospective”

Limit action items to 1-3 per retrospective. More than three means nothing gets done. One well-executed improvement is worth more than five abandoned ones.

Psychological Safety Is a Prerequisite

A retrospective only works if team members feel safe to speak honestly about what is not working. Without psychological safety, retrospectives produce sanitized, non-actionable discussion.

Signs of Low Psychological Safety

Only senior team members speak
Nobody mentions problems - everything is “fine”
Issues that everyone knows about are never raised
Team members vent privately after the retrospective instead of during it
Action items are always about tools or processes, never about behaviors

Building Psychological Safety

Practice	Why It Helps
Leader speaks last	Prevents the leader’s opinion from anchoring the discussion
Anonymous input	Use sticky notes or digital tools where input is anonymous initially
Blame-free language	“The deploy failed” not “You broke the deploy”
Follow through on raised issues	Nothing destroys safety faster than raising a concern and having it ignored
Acknowledge mistakes openly	Leaders who admit their own mistakes make it safe for others to do the same
Separate retrospective from performance review	If retro content affects reviews, people will not be honest

Treat Improvement as a Deliverable

The most common failure mode for retrospectives is producing action items that never get done. This happens when improvement work is treated as something to do “when we have time” - which means never.

Make Improvement Visible

Add improvement items to the same board as feature work
Include improvement items in WIP limits
Track improvement items through the same workflow as any other deliverable

Allocate Capacity

Reserve a percentage of team capacity for improvement work. Common allocations:

Allocation	Approach
20% continuous	One day per week (or equivalent) dedicated to improvement, tooling, and tech debt
Dedicated improvement sprint	Every 4th sprint is entirely improvement-focused
Improvement as first pull	When someone finishes work and the WIP limit allows, the first option is an improvement item

The specific allocation matters less than having one. A team that explicitly budgets 10% for improvement will improve more than a team that aspires to 20% but never protects the time.

Retrospective Cadence

Cadence	Best For	Caution
Weekly	Teams in active CD migration, teams working through major changes	Can feel like too many meetings if not well-facilitated
Bi-weekly	Teams in steady state with ongoing improvement	Most common cadence
After incidents	Any team	Incident retrospectives (postmortems) are separate from regular retrospectives
Monthly	Mature teams with well-established improvement habits	Too infrequent for teams early in their migration

During active phases of a CD migration (Phases 1-3), weekly retrospectives are recommended. Once the team reaches Phase 4, bi-weekly is usually sufficient.

Running Your First CD Migration Retrospective

If your team has not been running effective retrospectives, start here:

Before the Retrospective

Collect your DORA metrics for the past two weeks
Review any action items from the previous retrospective (if applicable)
Prepare a shared document or board with the five-part structure

During the Retrospective (60 minutes)

Review mission (5 min): State your CD migration goal for this phase
Review KPIs (10 min): Present the DORA metrics. Ask: “What do you notice?”
Review experiments (10 min): Discuss any experiments that were run
Check goals (10 min): Review action items from last time
Open conversation (25 min): Use Start/Stop/Continue for the first time - it is the simplest format

After the Retrospective

Publish the action items where the team will see them daily
Assign owners and due dates
Add improvement items to the team board
Schedule the next retrospective

Key Pitfalls

1. “Our retrospectives always produce the same complaints”

If the same issues surface repeatedly, the team is not executing on its action items. Check whether improvement work is being prioritized alongside feature work. If it is not, no amount of retrospective technique will help.

2. “People don’t want to attend because nothing changes”

This is a symptom of the same problem - action items are not executed. The fix is to start small: commit to one action item per retrospective, execute it completely, and demonstrate the result at the next retrospective. Success builds momentum.

3. “The retrospective turns into a blame session”

The facilitator must enforce blame-free language. Redirect “You did X wrong” to “When X happened, the impact was Y. How can we prevent Y?” If blame is persistent, the team has a psychological safety problem that needs to be addressed separately.

4. “We don’t have time for retrospectives”

A team that does not have time to improve will never improve. A 60-minute retrospective that produces one executed improvement is the highest-leverage hour of the entire sprint.

Measuring Success

Indicator	Target	Why It Matters
Retrospective attendance	100% of team	Confirms the team values the practice
Action items completed	> 80% completion rate	Confirms improvement is treated as a deliverable
DORA metrics trend	Improving quarter over quarter	Confirms retrospectives lead to real improvement
Team engagement	Voluntary contributions increasing	Confirms psychological safety is present

Next Step

With metrics-driven improvement and effective retrospectives, you have the engine for continuous improvement. The final optimization step is Architecture Decoupling - ensuring your system’s architecture does not prevent you from deploying independently.

3.4.6 - Architecture Decoupling

Enable independent deployment of components by decoupling architecture boundaries.

Phase 3 - Optimize | Original content based on Dojo Consortium delivery journey patterns

You cannot deploy independently if your architecture requires coordinated releases. This page describes the three architecture states teams encounter on the journey to continuous deployment and provides practical strategies for moving from entangled to loosely coupled.

Why Architecture Matters for CD

Every practice in this guide - small batches, feature flags, WIP limits - assumes that your team can deploy its changes independently. But if your application is a monolith where changing one module requires retesting everything, or a set of microservices with tightly coupled APIs, independent deployment is impossible regardless of how good your practices are.

Architecture is either an enabler or a blocker for continuous deployment. There is no neutral.

Three Architecture States

The Delivery System Improvement Journey describes three states that teams move through. Most teams start entangled. The goal is to reach loosely coupled.

State 1: Entangled

In an entangled architecture, everything is connected to everything. Changes in one area routinely break other areas. Teams cannot deploy independently.

Characteristics:

Shared database schemas with no ownership boundaries
Circular dependencies between modules or services
Deploying one service requires deploying three others at the same time
Integration testing requires the entire system to be running
A single team’s change can block every other team’s release
“Big bang” releases on a fixed schedule

Impact on delivery:

Metric	Typical State
Deployment frequency	Monthly or quarterly (because coordinating releases is hard)
Lead time	Weeks to months (because changes wait for the next release train)
Change failure rate	High (because big releases mean big risk)
MTTR	Long (because failures cascade across boundaries)

How you got here: Entanglement is the natural result of building quickly without deliberate architectural boundaries. It is not a failure - it is a stage that almost every system passes through.

State 2: Tightly Coupled

In a tightly coupled architecture, there are identifiable boundaries between components, but those boundaries are leaky. Teams have some independence, but coordination is still required for many changes.

Characteristics:

Services exist but share a database or use synchronous point-to-point calls
API contracts exist but are not versioned - breaking changes require simultaneous updates
Teams can deploy some changes independently, but cross-cutting changes require coordination
Integration testing requires multiple services but not the entire system
Release trains still exist but are smaller and more frequent

Impact on delivery:

Metric	Typical State
Deployment frequency	Weekly to bi-weekly
Lead time	Days to a week
Change failure rate	Moderate (improving but still affected by coupling)
MTTR	Hours (failures are more isolated but still cascade sometimes)

State 3: Loosely Coupled

In a loosely coupled architecture, components communicate through well-defined interfaces, own their own data, and can be deployed independently without coordinating with other teams.

Characteristics:

Each service owns its own data store - no shared databases
APIs are versioned; consumers and producers can be updated independently
Asynchronous communication (events, queues) is used where possible
Each team can deploy without coordinating with any other team
Services are designed to degrade gracefully if a dependency is unavailable
No release trains - each team deploys when ready

Impact on delivery:

Metric	Typical State
Deployment frequency	On-demand (multiple times per day)
Lead time	Hours
Change failure rate	Low (small, isolated changes)
MTTR	Minutes (failures are contained within service boundaries)

Moving from Entangled to Tightly Coupled

This is the first and most difficult transition. It requires establishing boundaries where none existed before.

Strategy 1: Identify Natural Seams

Look for places where the system already has natural boundaries, even if they are not enforced:

Different business domains: Orders, payments, inventory, and user accounts are different domains even if they live in the same codebase.
Different rates of change: Code that changes weekly and code that changes yearly should not be in the same deployment unit.
Different scaling needs: Components with different load profiles benefit from separate deployment.
Different team ownership: If different teams work on different parts of the codebase, those parts are candidates for separation.

Strategy 2: Strangler Fig Pattern

Instead of rewriting the system, incrementally extract components from the monolith.

Step 1: Route all traffic through a facade/proxy
Step 2: Build the new component alongside the old
Step 3: Route a small percentage of traffic to the new component
Step 4: Validate correctness and performance
Step 5: Route all traffic to the new component
Step 6: Remove the old code

Key rule: The strangler fig pattern must be done incrementally. If you try to extract everything at once, you are doing a rewrite, not a strangler fig.

Strategy 3: Define Ownership Boundaries

Assign clear ownership of each module or component to a single team. Ownership means:

The owning team decides the API contract
The owning team deploys the component
Other teams consume the API, not the internal implementation
Changes to the API contract require agreement from consumers (but not simultaneous deployment)

What to Avoid

The “big rewrite”: Rewriting a monolith from scratch almost always fails. Use the strangler fig pattern instead.
Premature microservices: Do not split into microservices until you have clear domain boundaries and team ownership. Microservices with unclear boundaries are a distributed monolith - the worst of both worlds.
Shared databases across services: This is the most common coupling mechanism. If two services share a database, they cannot be deployed independently because a schema change in one service can break the other.

Moving from Tightly Coupled to Loosely Coupled

This transition is about hardening the boundaries that were established in the previous step.

Strategy 1: Eliminate Shared Data Stores

If two services share a database, one of three things needs to happen:

One service owns the data, the other calls its API. The dependent service no longer accesses the database directly.
The data is duplicated. Each service maintains its own copy, synchronized via events.
The shared data becomes a dedicated data service. Both services consume from a service that owns the data.

BEFORE (shared database):
  Service A → [Shared DB] ← Service B

AFTER (option 1 - API ownership):
  Service A → [DB A]
  Service B → Service A API → [DB A]

AFTER (option 2 - event-driven duplication):
  Service A → [DB A] → Events → Service B → [DB B]

AFTER (option 3 - data service):
  Service A → Data Service → [DB]
  Service B → Data Service → [DB]

Strategy 2: Version Your APIs

API versioning allows consumers and producers to evolve independently.

Rules for API versioning:

Never make a breaking change without a new version. Adding fields is non-breaking. Removing fields is breaking. Changing field types is breaking.
Support at least two versions simultaneously. This gives consumers time to migrate.
Deprecate old versions with a timeline. “Version 1 will be removed on date X.”
Use consumer-driven contract tests to verify compatibility. See Contract Testing.

Strategy 3: Prefer Asynchronous Communication

Synchronous calls (HTTP, gRPC) create temporal coupling: if the downstream service is slow or unavailable, the upstream service is also affected.

Communication Style	Coupling	When to Use
Synchronous (HTTP/gRPC)	Temporal + behavioral	When the caller needs an immediate response
Asynchronous (events/queues)	Behavioral only	When the caller does not need an immediate response
Event-driven (publish/subscribe)	Minimal	When the producer does not need to know about consumers

Prefer asynchronous communication wherever the business requirements allow it. Not every interaction needs to be synchronous.

Strategy 4: Design for Failure

In a loosely coupled system, dependencies will be unavailable sometimes. Design for this:

Circuit breakers: Stop calling a failing dependency after N failures. Return a degraded response instead.
Timeouts: Set aggressive timeouts on all external calls. A 30-second timeout on a service that should respond in 100ms is not a timeout - it is a hang.
Bulkheads: Isolate failures so that one failing dependency does not consume all resources.
Graceful degradation: Define what the user experience should be when a dependency is down. “Recommendations unavailable” is better than a 500 error.

Practical Steps for Architecture Decoupling

Month 1: Map Dependencies

Before changing anything, understand what you have:

Draw a dependency graph. Which components depend on which? Where are the shared databases?
Identify deployment coupling. Which components must be deployed together? Why?
Identify the highest-impact coupling. Which coupling most frequently blocks independent deployment?

Month 2-3: Establish the First Boundary

Pick one component to decouple. Choose the one with the highest impact and lowest risk:

Apply the strangler fig pattern to extract it
Define a clear API contract
Move its data to its own data store
Deploy it independently

Month 4+: Repeat

Take the next highest-impact coupling and address it. Each decoupling makes the next one easier because the team learns the patterns and the remaining system is simpler.

Key Pitfalls

1. “We need to rewrite everything before we can deploy independently”

No. Decoupling is incremental. Extract one component, deploy it independently, prove the pattern works, then continue. A partial decoupling that enables one team to deploy independently is infinitely more valuable than a planned rewrite that never finishes.

2. “We split into microservices but our lead time got worse”

Microservices add operational complexity (more services to deploy, monitor, and debug). If you split without investing in deployment automation, observability, and team autonomy, you will get worse, not better. Microservices are a tool for organizational scaling, not a silver bullet for delivery speed.

3. “Teams keep adding new dependencies that recouple the system”

Architecture decoupling requires governance. Establish architectural principles (e.g., “no shared databases”) and enforce them through automated checks (e.g., dependency analysis in CI) and architecture reviews for cross-boundary changes.

4. “We can’t afford the time to decouple”

You cannot afford not to. Every week spent doing coordinated releases is a week of delivery capacity lost to coordination overhead. The investment in decoupling pays for itself quickly through increased deployment frequency and reduced coordination cost.

Measuring Success

Metric	Target	Why It Matters
Teams that can deploy independently	Increasing	The primary measure of decoupling
Coordinated releases per quarter	Decreasing toward zero	Confirms coupling is being eliminated
Deployment frequency per team	Increasing independently	Confirms teams are not blocked by each other
Cross-team dependencies per feature	Decreasing	Confirms architecture supports independent work

Next Step

With optimized flow, small batches, metrics-driven improvement, and a decoupled architecture, your team is ready for the final phase. Continue to Phase 4: Deliver on Demand.

3.5 - Phase 4: Deliver on Demand

The capability to deploy any change to production at any time, using the delivery strategy that fits your context.

Key question: “Can we deliver any change to production when the business needs it?”

This is the destination: you can deploy any change that passes the pipeline to production whenever you choose. Some teams will auto-deploy every commit (continuous deployment). Others will deploy on demand when the business is ready. Both are valid - the capability is what matters, not the trigger.

What You’ll Do

Deploy on demand - Remove the last manual gates so any green build can reach production
Use progressive rollout - Canary, blue-green, and percentage-based deployments
Explore agentic CD - AI-assisted continuous delivery patterns
Learn from experience reports - How other teams made the journey

Continuous Delivery vs. Continuous Deployment

These terms are often confused. The distinction matters for this phase:

Continuous delivery means every commit that passes the pipeline could be deployed to production at any time. The capability exists. A human or business process decides when.
Continuous deployment means every commit that passes the pipeline is deployed to production automatically. No human decision is involved.

Continuous delivery is the goal of this migration guide. Continuous deployment is one delivery strategy that works well for certain contexts - SaaS products, internal tools, services behind feature flags. It is not a higher level of maturity. A team that deploys on demand with a one-click deploy is just as capable as a team that auto-deploys every commit.

Why This Phase Matters

When your foundations are solid, your pipeline is reliable, and your batch sizes are small, deploying any change becomes low-risk. The remaining barriers are organizational, not technical: approval processes, change windows, release coordination. This phase addresses those barriers so the team has the option to deploy whenever the business needs it.

Signs You’ve Arrived

Any commit that passes the pipeline can reach production within minutes
The team deploys frequently (daily or more) with no drama
Mean time to recovery is measured in minutes
The team has confidence that any deployment can be safely rolled back
New team members can deploy on their first day
The deployment strategy (on-demand or automatic) is a team choice, not a constraint

3.5.1 - Deploy on Demand

Remove the last manual gates and deploy every change that passes the pipeline.

Phase 4 - Deliver on Demand | Original content

Deploy on demand means that any change which passes the full automated pipeline can reach production without waiting for a human to press a button, open a ticket, or schedule a window. This page covers the prerequisites, the transition from continuous delivery to continuous deployment, and how to address the organizational concerns that are the real barriers.

Continuous Delivery vs. Continuous Deployment

These two terms are often confused. The distinction matters:

Continuous Delivery: Every commit that passes the pipeline could be deployed to production. A human decides when to deploy.
Continuous Deployment: Every commit that passes the pipeline is deployed to production. No human decision is required.

If you have completed Phases 1-3 of this migration, you have continuous delivery. This page is about removing that last manual decision and moving to continuous deployment.

Why Remove the Last Gate?

The manual deployment decision feels safe. It gives someone a chance to “eyeball” the change before it goes to production. In practice, it does the opposite.

The Problems with Manual Gates

Problem	Why It Happens	Impact
Batching	If deploys are manual, teams batch changes to reduce the number of deploy events	Larger batches increase risk and make rollback harder
Delay	Changes wait for someone to approve, which may take hours or days	Longer lead time, delayed feedback
False confidence	The approver cannot meaningfully review what the automated pipeline already tested	The gate provides the illusion of safety without actual safety
Bottleneck	One person or team becomes the deploy gatekeeper	Creates a single point of failure for the entire delivery flow
Deploy fear	Infrequent deploys mean each deploy is higher stakes	Teams become more cautious, batches get larger, risk increases

The Paradox of Manual Safety

The more you rely on manual deployment gates, the less safe your deployments become. This is because manual gates lead to batching, batching increases risk, and increased risk justifies more manual gates. It is a vicious cycle.

Continuous deployment breaks this cycle. Small, frequent, automated deployments are individually low-risk. If one fails, the blast radius is small and recovery is fast.

Prerequisites for Deploy on Demand

Before removing manual gates, verify that these conditions are met. Each one is covered in earlier phases of this migration.

Non-Negotiable Prerequisites

Prerequisite	What It Means	Where to Build It
Comprehensive automated tests	The test suite catches real defects, not just trivial cases	Testing Fundamentals
Fast, reliable pipeline	The pipeline completes in under 15 minutes and rarely fails for non-code reasons	Deterministic Pipeline
Automated rollback	You can roll back a bad deployment in minutes without manual intervention	Rollback
Feature flags	Incomplete features are hidden from users via flags, not deployment timing	Feature Flags
Small batch sizes	Each deployment contains 1-3 small changes, not dozens	Small Batches
Production-like environments	Test environments match production closely enough that test results are trustworthy	Production-Like Environments
Observability	You can detect production issues within minutes through monitoring and alerting	Metrics-Driven Improvement

Assessment: Are You Ready?

Answer these questions honestly:

When was the last time your pipeline caught a real bug? If the answer is “I don’t remember,” your test suite may not be trustworthy enough.
How long does a rollback take? If the answer is more than 15 minutes, automate it first.
Do deploys ever fail for non-code reasons? (Environment issues, credential problems, network flakiness.) If yes, stabilize your pipeline first.
Does the team trust the pipeline? If team members regularly say “let me check one more thing before we deploy,” trust is not there yet. Build it through retrospectives and transparent metrics.

The Transition: Three Approaches

Approach 1: Shadow Mode

Run continuous deployment alongside manual deployment. Every change that passes the pipeline is automatically deployed to a shadow production environment (or a canary group). A human still approves the “real” production deployment.

Duration: 2-4 weeks.

What you learn: How often the automated deployment would have been correct. If the answer is “every time” (or close to it), the manual gate is not adding value.

Transition: Once the team sees that the shadow deployments are consistently safe, remove the manual gate.

Approach 2: Opt-In per Team

Allow individual teams to adopt continuous deployment while others continue with manual gates. This works well in organizations with multiple teams at different maturity levels.

Duration: Ongoing. Teams opt in when they are ready.

What you learn: Which teams are ready and which need more foundation work. Early adopters demonstrate the pattern for the rest of the organization.

Transition: As more teams succeed, continuous deployment becomes the default. Remaining teams are supported in reaching readiness.

Approach 3: Direct Switchover

Remove the manual gate for all teams at once. This is appropriate when the organization has high confidence in its pipeline and all teams have completed Phases 1-3.

Duration: Immediate.

What you learn: Quickly reveals any hidden dependencies on the manual gate (e.g., deploy coordination between teams, configuration changes that ride along with deployments).

Transition: Be prepared to temporarily revert if unforeseen issues arise. Have a clear rollback plan for the process change itself.

Addressing Organizational Concerns

The technical prerequisites are usually met before the organizational ones. These are the conversations you will need to have.

“What about change management / ITIL?”

Change management frameworks like ITIL define a “standard change” category: a pre-approved, low-risk, well-understood change that does not require a Change Advisory Board (CAB) review. Continuous deployment changes qualify as standard changes because they are:

Small (one to a few commits)
Automated (same pipeline every time)
Reversible (automated rollback)
Well-tested (comprehensive automated tests)

Work with your change management team to classify pipeline-passing deployments as standard changes. This preserves the governance framework while removing the bottleneck.

“What about compliance and audit?”

Continuous deployment does not eliminate audit trails - it strengthens them. Every deployment is:

Traceable: Tied to a specific commit, which is tied to a specific story or ticket
Reproducible: The same pipeline produces the same result every time
Recorded: Pipeline logs capture every test that passed, every approval that was automated
Reversible: Rollback history shows when and why a deployment was reverted

Provide auditors with access to pipeline logs, deployment history, and the automated test suite. This is a more complete audit trail than a manual approval signature.

“What about database migrations?”

Database migrations require special care in continuous deployment because they cannot be rolled back as easily as code changes.

Rules for database migrations in CD:

Migrations must be backward-compatible. The previous version of the code must work with the new schema.
Use expand/contract pattern. First deploy the new column/table (expand). Then deploy the code that uses it. Then remove the old column/table (contract). Each step is a separate deployment.
Never drop a column in the same deployment that stops using it. There is always a window where both old and new code run simultaneously.
Test migrations in production-like environments before they reach production.

“What if we deploy a breaking change?”

This is why you have automated rollback and observability. The sequence is:

Deployment happens automatically
Monitoring detects an issue (error rate spike, latency increase, health check failure)
Automated rollback triggers (or on-call engineer triggers manual rollback)
The team investigates and fixes the issue
The fix goes through the pipeline and deploys automatically

The key insight: this sequence takes minutes with continuous deployment. With manual deployment on a weekly schedule, the same breaking change would take days to detect and fix.

After the Transition

What Changes for the Team

Before	After
“Are we deploying today?”	Deploys happen automatically, all the time
“Who’s doing the deploy?”	Nobody - the pipeline does it
“Can I get this into the next release?”	Every merge to trunk is the next release
“We need to coordinate the deploy with team X”	Teams deploy independently
“Let’s wait for the deploy window”	There are no deploy windows

What Stays the Same

Code review still happens (before merge to trunk)
Automated tests still run (in the pipeline)
Feature flags still control feature visibility (decoupling deploy from release)
Monitoring still catches issues (but now recovery is faster)
The team still owns its deployments (but the manual step is gone)

The First Week

The first week of continuous deployment will feel uncomfortable. This is normal. The team will instinctively want to “check” deployments that happen automatically. Resist the urge to add manual checks back. Instead:

Watch the monitoring dashboards more closely than usual
Have the team discuss each automatic deployment in standup for the first week
Celebrate the first deployment that goes out without anyone noticing - that is the goal

Key Pitfalls

1. “We adopted continuous deployment but kept the approval step ‘just in case’”

If the approval step exists, it will be used, and you have not actually adopted continuous deployment. Remove the gate completely. If something goes wrong, use rollback - do not use a pre-deployment gate.

2. “Our deploy cadence didn’t actually increase”

Continuous deployment only increases deploy frequency if the team is integrating to trunk frequently. If the team still merges weekly, they will deploy weekly - automatically, but still weekly. Revisit Trunk-Based Development and Small Batches.

3. “We have continuous deployment for the application but not the database/infrastructure”

Partial continuous deployment creates a split experience: application changes flow freely but infrastructure changes still require manual coordination. Extend the pipeline to cover infrastructure as code, database migrations, and configuration changes.

Measuring Success

Metric	Target	Why It Matters
Deployment frequency	Multiple per day	Confirms the pipeline is deploying every change
Lead time	< 1 hour from commit to production	Confirms no manual gates are adding delay
Manual interventions per deploy	Zero	Confirms the process is fully automated
Change failure rate	Stable or improving	Confirms automation is not introducing new failures
MTTR	< 15 minutes	Confirms automated rollback is working

Next Step

Continuous deployment deploys every change, but not every change needs to go to every user at once. Progressive Rollout strategies let you control who sees a change and how quickly it spreads.

3.5.2 - Progressive Rollout

Use canary, blue-green, and percentage-based deployments to reduce deployment risk.

Phase 4 - Deliver on Demand | Original content

Progressive rollout strategies let you deploy to production without deploying to all users simultaneously. By exposing changes to a small group first and expanding gradually, you catch problems before they affect your entire user base. This page covers the three major strategies, when to use each, and how to implement automated rollback.

Why Progressive Rollout?

Even with comprehensive tests, production-like environments, and small batch sizes, some issues only surface under real production traffic. Progressive rollout is the final safety layer: it limits the blast radius of any deployment by exposing the change to a small audience first.

This is not a replacement for testing. It is an addition. Your automated tests should catch the vast majority of issues. Progressive rollout catches the rest - the issues that depend on real user behavior, real data volumes, or real infrastructure conditions that cannot be fully replicated in test environments.

The Three Strategies

Strategy 1: Canary Deployment

A canary deployment routes a small percentage of production traffic to the new version while the majority continues to hit the old version. If the canary shows no problems, traffic is gradually shifted.

                        ┌─────────────────┐
                   5%   │  New Version     │  ← Canary
                ┌──────►│  (v2)            │
                │       └─────────────────┘
  Traffic ──────┤
                │       ┌─────────────────┐
                └──────►│  Old Version     │  ← Stable
                  95%   │  (v1)            │
                        └─────────────────┘

How it works:

Deploy the new version alongside the old version
Route 1-5% of traffic to the new version
Compare key metrics (error rate, latency, business metrics) between canary and stable
If metrics are healthy, increase traffic to 25%, 50%, 100%
If metrics degrade, route all traffic back to the old version

When to use canary:

Changes that affect request handling (API changes, performance optimizations)
Changes where you want to compare metrics between old and new versions
Services with high traffic volume (you need enough canary traffic for statistical significance)

When canary is not ideal:

Changes that affect batch processing or background jobs (no “traffic” to route)
Very low traffic services (the canary may not get enough traffic to detect issues)
Database schema changes (both versions must work with the same schema)

Implementation options:

Infrastructure	How to Route Traffic
Kubernetes + service mesh (Istio, Linkerd)	Weighted routing rules in VirtualService
Load balancer (ALB, NGINX)	Weighted target groups
CDN (CloudFront, Fastly)	Origin routing rules
Application-level	Feature flag with percentage rollout

Strategy 2: Blue-Green Deployment

Blue-green deployment maintains two identical production environments. At any time, one (blue) serves live traffic and the other (green) is idle or staging.

  BEFORE:
    Traffic ──────► [Blue - v1] (ACTIVE)
                    [Green]     (IDLE)

  DEPLOY:
    Traffic ──────► [Blue - v1] (ACTIVE)
                    [Green - v2] (DEPLOYING / SMOKE TESTING)

  SWITCH:
    Traffic ──────► [Green - v2] (ACTIVE)
                    [Blue - v1]  (STANDBY / ROLLBACK TARGET)

How it works:

Deploy the new version to the idle environment (green)
Run smoke tests against green to verify basic functionality
Switch the router/load balancer to point all traffic at green
Keep blue running as an instant rollback target
After a stability period, repurpose blue for the next deployment

When to use blue-green:

You need instant, complete rollback (switch the router back)
You want to test the deployment in a full production environment before routing traffic
Your infrastructure supports running two parallel environments cost-effectively

When blue-green is not ideal:

Stateful applications where both environments share mutable state
Database migrations (the new version’s schema must work for both environments during transition)
Cost-sensitive environments (maintaining two full production environments doubles infrastructure cost)

Rollback speed: Seconds. Switching the router back is the fastest rollback mechanism available.

Strategy 3: Percentage-Based Rollout

Percentage-based rollout gradually increases the number of users who see the new version. Unlike canary (which is traffic-based), percentage rollout is typically user-based - a specific user always sees the same version during the rollout period.

  Hour 0:   1% of users  → v2,  99% → v1
  Hour 2:   5% of users  → v2,  95% → v1
  Hour 8:  25% of users  → v2,  75% → v1
  Day 2:   50% of users  → v2,  50% → v1
  Day 3:  100% of users  → v2

How it works:

Enable the new version for a small percentage of users (using feature flags or infrastructure routing)
Monitor metrics for the affected group
Gradually increase the percentage over hours or days
At any point, reduce the percentage back to 0% if issues are detected

When to use percentage rollout:

User-facing feature changes where you want consistent user experience (a user always sees v1 or v2, not a random mix)
Changes that benefit from A/B testing data (compare user behavior between groups)
Long-running rollouts where you want to collect business metrics before full exposure

When percentage rollout is not ideal:

Backend infrastructure changes with no user-visible impact
Changes that affect all users equally (e.g., API response format changes)

Implementation: Percentage rollout is typically implemented through Feature Flags (Level 2 or Level 3), using the user ID as the hash key to ensure consistent assignment.

Choosing the Right Strategy

Factor	Canary	Blue-Green	Percentage
Rollback speed	Seconds (reroute traffic)	Seconds (switch environments)	Seconds (disable flag)
Infrastructure cost	Low (runs alongside existing)	High (two full environments)	Low (same infrastructure)
Metric comparison	Strong (side-by-side comparison)	Weak (before/after only)	Strong (group comparison)
User consistency	No (each request may hit different version)	Yes (all users see same version)	Yes (each user sees consistent version)
Complexity	Moderate	Moderate	Low (if you have feature flags)
Best for	API changes, performance changes	Full environment validation	User-facing features

Many teams use more than one strategy. A common pattern:

Blue-green for infrastructure and platform changes
Canary for service-level changes
Percentage rollout for user-facing feature changes

Automated Rollback

Progressive rollout is only effective if rollback is automated. A human noticing a problem at 3 AM is not a reliable rollback mechanism.

Metrics to Monitor

Define automated rollback triggers before deploying. Common triggers:

Metric	Trigger Condition	Example
Error rate	Canary error rate > 2x stable error rate	Stable: 0.1%, Canary: 0.3% -> rollback
Latency (p99)	Canary p99 > 1.5x stable p99	Stable: 200ms, Canary: 400ms -> rollback
Health check	Any health check failure	HTTP 500 on /health -> rollback
Business metric	Conversion rate drops > 5% for canary group	10% conversion -> 4% conversion -> rollback
Saturation	CPU or memory exceeds threshold	CPU > 90% for 5 minutes -> rollback

Automated Rollback Flow

Deploy new version
       │
       ▼
Route 5% of traffic to new version
       │
       ▼
Monitor for 15 minutes
       │
       ├── Metrics healthy ──────► Increase to 25%
       │                                │
       │                                ▼
       │                          Monitor for 30 minutes
       │                                │
       │                                ├── Metrics healthy ──────► Increase to 100%
       │                                │
       │                                └── Metrics degraded ─────► ROLLBACK
       │
       └── Metrics degraded ─────► ROLLBACK

Implementation Tools

Tool	How It Helps
Argo Rollouts	Kubernetes-native progressive delivery with automated analysis and rollback
Flagger	Progressive delivery operator for Kubernetes with Istio, Linkerd, or App Mesh
Spinnaker	Multi-cloud deployment platform with canary analysis
Custom scripts	Query your metrics system, compare thresholds, trigger rollback via API

The specific tool matters less than the principle: define rollback criteria before deploying, monitor automatically, and roll back without human intervention.

Implementing Progressive Rollout

Step 1: Choose Your First Strategy (Week 1)

Pick the strategy that matches your infrastructure:

If you already have feature flags: start with percentage-based rollout
If you have Kubernetes with a service mesh: start with canary
If you have parallel environments: start with blue-green

Step 2: Define Rollback Criteria (Week 1)

Before your first progressive deployment:

Identify the 3-5 metrics that define “healthy” for your service
Define numerical thresholds for each metric
Define the monitoring window (how long to wait before advancing)
Document the rollback procedure (even if automated, document it for human understanding)

Step 3: Run a Manual Progressive Rollout (Week 2-3)

Before automating, run the process manually:

Deploy to a canary or small percentage
A team member monitors the dashboard for the defined window
The team member decides to advance or rollback
Document what they checked and how they decided

This manual practice builds understanding of what the automation will do.

Step 4: Automate the Rollout (Week 4+)

Replace the manual monitoring with automated checks:

Implement metric queries that check your rollback criteria
Implement automated traffic shifting (advance or rollback based on metrics)
Implement alerting so the team knows when a rollback occurs
Test the automation by intentionally deploying a known-bad change (in a controlled way)

Key Pitfalls

1. “Our canary doesn’t get enough traffic for meaningful metrics”

If your service handles 100 requests per hour, a 5% canary gets 5 requests per hour - not enough to detect problems statistically. Solutions: use a higher canary percentage (25-50%), use longer monitoring windows, or use blue-green instead (which does not require traffic splitting).

2. “We have progressive rollout but rollback is still manual”

Progressive rollout without automated rollback is half a solution. If the canary shows problems at 2 AM and nobody is watching, the damage occurs before anyone responds. Automated rollback is the essential companion to progressive rollout.

3. “We treat progressive rollout as a replacement for testing”

Progressive rollout is the last line of defense, not the first. If you are regularly catching bugs in canary that your test suite should have caught, your test suite needs improvement. Progressive rollout should catch rare, production-specific issues - not common bugs.

4. “Our rollout takes days because we’re too cautious”

A rollout that takes a week negates the benefits of continuous deployment. If your confidence in the pipeline is low enough to require a week-long rollout, the issue is pipeline quality, not rollout speed. Address the root cause through better testing and more production-like environments.

Measuring Success

Metric	Target	Why It Matters
Automated rollbacks per month	Low and stable	Confirms the pipeline catches most issues before production
Time from deploy to full rollout	Hours, not days	Confirms the team has confidence in the process
Incidents caught by progressive rollout	Tracked (any number)	Confirms the progressive rollout is providing value
Manual interventions during rollout	Zero	Confirms the process is fully automated

Next Step

With deploy on demand and progressive rollout, your technical deployment infrastructure is complete. Agentic CD explores how AI-assisted patterns can extend these practices further.

3.5.3 - Agentic CD

Extend continuous deployment with constraints and practices for AI agent-generated changes.

Phase 4 - Deliver on Demand | Adapted from MinimumCD.org

As AI coding agents become capable of generating production-ready code changes, the continuous deployment pipeline must evolve to handle agent-generated work with the same rigor applied to human-generated work - and in some cases, more rigor. Agentic CD defines the additional constraints and artifacts needed when agents contribute to the delivery pipeline.

What Is Agentic CD?

Agentic CD extends the Minimum CD framework to address a new category of contributor: AI agents that can generate, test, and propose code changes. These agents may operate autonomously (generating changes without human prompting) or collaboratively (assisting a human developer).

The core principle is simple: an agent-generated change must meet or exceed the same quality bar as a human-generated change. The pipeline does not care who wrote the code. It cares whether the code is correct, tested, and safe to deploy.

But agents introduce unique challenges that require additional constraints:

Agents can generate changes faster than humans can review them
Agents may lack context about organizational norms, business rules, or unstated constraints
Agents cannot currently exercise judgment about risk in the same way humans can
Agents may introduce subtle correctness issues that pass automated tests but violate intent

The Six First-Class Artifacts

Agentic CD defines six artifacts that must be explicitly maintained in a delivery pipeline that includes AI agents. These artifacts exist in human-driven CD as well, but they are often implicit. When agents are involved, they must be explicit.

1. Intent Description

What it is: A human-readable description of the desired change, written by a human.

Why it matters for agentic CD: The intent description is the agent’s “prompt” in the broadest sense. It defines what the change should accomplish, not how. Without a clear intent description, the agent may generate technically correct code that does not match what was needed.

Example:

## Intent: Add rate limiting to the /api/search endpoint

We are receiving complaints about slow response times during peak hours.
Analysis shows that a small number of clients are making thousands of
requests per minute. We need to limit each authenticated client to 100
requests per minute on the /api/search endpoint. Requests that exceed
the limit should receive a 429 response with a Retry-After header.

Key property: The intent description is authored by a human. It is the human’s specification of what the agent should achieve. The agent does not write or modify the intent description.

2. User-Facing Behavior

What it is: A description of how the system should behave from the user’s perspective, expressed as observable outcomes.

Why it matters for agentic CD: Agents can generate code that satisfies tests but does not produce the expected user experience. User-facing behavior descriptions bridge the gap between technical correctness and user value.

Format: BDD scenarios work well here (see Small Batches):

Scenario: Client exceeds rate limit
  Given an authenticated client
  And the client has made 100 requests in the current minute
  When the client makes another request to /api/search
  Then the response status should be 429
  And the response should include a Retry-After header
  And the Retry-After value should indicate when the limit resets

Scenario: Client within rate limit
  Given an authenticated client
  And the client has made 50 requests in the current minute
  When the client makes a request to /api/search
  Then the request should be processed normally
  And the response should include rate limit headers showing remaining quota

3. Feature Description

What it is: A technical description of the feature’s architecture, dependencies, and integration points.

Why it matters for agentic CD: Agents need explicit architectural context that human developers often carry in their heads. The feature description tells the agent where the change fits in the system, what components it touches, and what constraints apply.

Example:

## Feature: Rate Limiting for Search API

### Architecture
- Rate limit middleware sits between authentication and the search handler
- Rate limit state is stored in Redis (shared across application instances)
- Rate limit configuration is read from the application config, not hardcoded

### Dependencies
- Redis client library (already in use for session storage)
- No new external dependencies should be introduced

### Constraints
- Must not add more than 5ms of latency to the request path
- Must work correctly with our horizontal scaling (3-12 instances)
- Must be configurable per-endpoint (other endpoints may have different limits later)

4. Executable Truth

What it is: Automated tests that define the correct behavior of the system. These tests are the authoritative source of truth for what the code should do.

Why it matters for agentic CD: For human developers, tests verify the code. For agent-generated code, tests also constrain the agent. If the tests are comprehensive, the agent cannot generate incorrect code that passes. If the tests are shallow, the agent can generate code that passes tests but does not satisfy the intent.

Key principle: Executable truth must be written or reviewed by a human before the agent generates the implementation. This inverts the common practice of writing tests after code. In agentic CD, the tests come first because they are the specification.

class TestRateLimiting:
    def test_allows_requests_within_limit(self, client, redis):
        for _ in range(100):
            response = client.get("/api/search", headers=auth_headers)
            assert response.status_code == 200

    def test_blocks_requests_exceeding_limit(self, client, redis):
        for _ in range(100):
            client.get("/api/search", headers=auth_headers)
        response = client.get("/api/search", headers=auth_headers)
        assert response.status_code == 429
        assert "Retry-After" in response.headers

    def test_limit_resets_after_window(self, client, redis, time_machine):
        for _ in range(100):
            client.get("/api/search", headers=auth_headers)
        time_machine.advance(seconds=61)
        response = client.get("/api/search", headers=auth_headers)
        assert response.status_code == 200

    def test_limits_are_per_client(self, client, redis):
        for _ in range(100):
            client.get("/api/search", headers=auth_headers_client_a)
        response = client.get("/api/search", headers=auth_headers_client_b)
        assert response.status_code == 200

    def test_latency_overhead_within_budget(self, client, redis, benchmark):
        result = benchmark(lambda: client.get("/api/search", headers=auth_headers))
        assert result.mean < 0.005  # 5ms budget

5. Implementation

What it is: The actual code that implements the feature. In agentic CD, this may be generated entirely by the agent, co-authored by agent and human, or authored by a human with agent assistance.

Why it matters for agentic CD: The implementation is the artifact most likely to be agent-generated. The key requirement is that it must satisfy the executable truth (tests), conform to the feature description (architecture), and achieve the intent description (purpose).

Review requirements: Agent-generated implementation must be reviewed by a human before merging to trunk. The review focuses on:

Does the implementation match the intent? (Not just “does it pass tests?”)
Does it follow the architectural constraints in the feature description?
Does it introduce unnecessary complexity, dependencies, or security risks?
Would a human developer on the team understand and maintain this code?

6. System Constraints

What it is: Non-functional requirements, security policies, performance budgets, and organizational rules that apply to all changes.

Why it matters for agentic CD: Human developers internalize system constraints through experience and team norms. Agents need these constraints stated explicitly.

Examples:

system_constraints:
  security:
    - No secrets in source code
    - All user input must be sanitized
    - Authentication required for all API endpoints
  performance:
    - API p99 latency < 500ms
    - No N+1 query patterns
    - Database queries must use indexes
  architecture:
    - No circular dependencies between modules
    - External service calls must use circuit breakers
    - All new dependencies require team approval
  operations:
    - All new features must have monitoring dashboards
    - Log structured data, not strings
    - Feature flags required for user-visible changes

The Agentic CD Workflow

When an AI agent contributes to a CD pipeline, the workflow extends the standard CD pipeline:

1. HUMAN writes Intent Description
2. HUMAN writes or reviews User-Facing Behavior (BDD scenarios)
3. HUMAN writes or reviews Feature Description (architecture)
4. HUMAN writes or reviews Executable Truth (tests)
5. AGENT generates Implementation (code)
6. PIPELINE validates Implementation against Executable Truth (automated tests)
7. HUMAN reviews Implementation (code review)
8. PIPELINE deploys (same pipeline as any other change)

Key differences from standard CD:

Steps 1-4 happen before the agent generates code (test-first is mandatory, not optional)
Step 7 (human review) is mandatory for agent-generated code
System constraints are checked automatically in the pipeline (Step 6)

Constraints for Agent-Generated Changes

Beyond the six artifacts, agentic CD imposes additional constraints on agent-generated changes:

Change Size Limits

Agent-generated changes must be small. Large agent-generated changes are harder to review and more likely to contain subtle issues.

Guideline: An agent-generated change should modify no more files and no more lines than a human would in a single commit. If the change is larger, break it into multiple sequential changes.

Mandatory Human Review

Every agent-generated change must be reviewed by a human before merging to trunk. This is a non-negotiable constraint. The purpose is not to check the agent’s “work” in a supervisory sense - it is to verify that the change matches the intent and fits the system.

Comprehensive Test Coverage

Agent-generated code must have higher test coverage than the team’s baseline. If the team’s baseline is 80% coverage, agent-generated code should target 90%+. This compensates for the reduced human oversight of the implementation details.

Provenance Tracking

The pipeline must record which changes were agent-generated, which agent generated them, and what prompt or intent description was used. This supports audit, debugging, and learning.

Getting Started with Agentic CD

Before jumping into agentic workflows, ensure your team has the prerequisite delivery practices in place. The AI Adoption Roadmap provides a step-by-step sequence: quality tools, clear requirements, hardened guardrails, and reduced delivery friction - all before accelerating with AI coding.

Phase 1: Agent as Assistant

The agent helps human developers write code, but the human makes all decisions and commits all changes. The pipeline does not know or care about agent involvement.

This is where most teams are today. It requires no pipeline changes.

Phase 2: Agent as Contributor

The agent generates complete changes based on intent descriptions and executable truth. A human reviews and merges. The pipeline validates.

Requires: Explicit intent descriptions, test-first workflow, human review gate.

Phase 3: Agent as Autonomous Contributor

The agent generates, tests, and proposes changes with minimal human involvement. Human review is still mandatory, but the agent handles the full cycle from intent to implementation.

Requires: All six first-class artifacts, comprehensive system constraints, provenance tracking, and high confidence in the executable truth.

Key Pitfalls

1. “We let the agent generate tests and code together”

If the agent writes both the tests and the code, the tests may be designed to pass the code rather than to verify the intent. Tests must be written or reviewed by a human before the agent generates the implementation. This is the most important constraint in agentic CD.

2. “The agent generates changes faster than we can review them”

This is a feature, not a bug - but only if you have the discipline to not merge unreviewed changes. The agent’s speed should not pressure humans to review faster. WIP limits apply: if the review queue is full, the agent stops generating new changes.

3. “We trust the agent because it passed the tests”

Passing tests is necessary but not sufficient. Tests cannot verify intent, architectural fitness, or maintainability. Human review remains mandatory.

4. “We don’t track which changes are agent-generated”

Without provenance tracking, you cannot learn from agent-generated failures, audit agent behavior, or improve the agent’s constraints over time. Track provenance from the start.

Measuring Success

Metric	Target	Why It Matters
Agent-generated change failure rate	Equal to or lower than human-generated	Confirms agent changes meet the same quality bar
Review time for agent-generated changes	Comparable to human-generated changes	Confirms changes are reviewable, not rubber-stamped
Test coverage for agent-generated code	Higher than baseline	Confirms the additional coverage constraint is met
Agent-generated changes with complete artifacts	100%	Confirms the six-artifact workflow is being followed

Next Step

For real-world examples of teams that have made the full journey to continuous deployment, see Experience Reports.

3.5.4 - Experience Reports

Real-world stories from teams that have made the journey to continuous deployment.

Phase 4 - Deliver on Demand | Adapted from MinimumCD.org

Theory is necessary but insufficient. This page collects experience reports from organizations that have adopted continuous deployment at scale, including the challenges they faced, the approaches they took, and the results they achieved. These reports demonstrate that CD is not limited to startups or greenfield projects - it works in large, complex, regulated environments.

Why Experience Reports Matter

Every team considering continuous deployment faces the same objection: “That works for [Google / Netflix / small startups], but our situation is different.” Experience reports counter this objection with evidence. They show that organizations of every size, in every industry, with every kind of legacy system, have found a path to continuous deployment.

No experience report will match your situation exactly. That is not the point. The point is to extract patterns: what obstacles did these teams encounter, and how did they overcome them?

Walmart: CD at Retail Scale

Context

Walmart operates one of the world’s largest e-commerce platforms alongside its massive physical retail infrastructure. Changes to the platform affect millions of transactions per day. The organization had a traditional release process with weekly deployment windows and multi-stage manual approval.

The Challenge

Scale: Thousands of developers across hundreds of teams
Risk tolerance: Any outage affects revenue in real time
Legacy: Decades of existing systems with deep interdependencies
Regulation: PCI compliance requirements for payment processing

What They Did

Invested in a centralized deployment platform (OneOps, later Concord) that standardized the deployment pipeline across all teams
Broke the monolithic release into independent service deployments
Implemented automated canary analysis for every deployment
Moved from weekly release trains to on-demand deployment per team

Key Lessons

Platform investment pays off. Building a shared deployment platform let hundreds of teams adopt CD without each team solving the same infrastructure problems.
Compliance and CD are compatible. Automated pipelines with full audit trails satisfied PCI requirements more reliably than manual approval processes.
Cultural change is harder than technical change. Teams that had operated on weekly release cycles for years needed coaching and support to trust automated deployment.

Microsoft: From Waterfall to Daily Deploys

Context

Microsoft’s Azure DevOps (formerly Visual Studio Team Services) team made a widely documented transformation from 3-year waterfall releases to deploying multiple times per day. This transformation happened within one of the largest software organizations in the world.

The Challenge

History: Decades of waterfall development culture
Product complexity: A platform used by millions of developers
Organizational size: Thousands of engineers across multiple time zones
Customer expectations: Enterprise customers expected stability and predictability

What They Did

Broke the product into independently deployable services (ring-based deployment)
Implemented a ring-based rollout: Ring 0 (team), Ring 1 (internal Microsoft users), Ring 2 (select external users), Ring 3 (all users)
Invested heavily in automated testing, achieving thousands of tests running in minutes
Moved from a fixed release cadence to continuous deployment with feature flags controlling release
Used telemetry to detect issues in real-time and automated rollback when metrics degraded

Key Lessons

Ring-based deployment is progressive rollout. Microsoft’s ring model is an implementation of the progressive rollout strategies described in this guide.
Feature flags enabled decoupling. By deploying frequently but releasing features incrementally via flags, the team could deploy without worrying about feature completeness.
The transformation took years, not months. Moving from 3-year cycles to daily deployment was a multi-year journey with incremental progress at each step.

Google: Engineering Productivity at Scale

Context

Google is often cited as the canonical example of continuous deployment, deploying changes to production thousands of times per day across its vast service portfolio.

The Challenge

Scale: Billions of users, millions of servers
Monorepo: Most of Google operates from a single repository with billions of lines of code
Interdependencies: Changes in shared libraries can affect thousands of services
Velocity: Thousands of engineers committing changes every day

What They Did

Built a culture of automated testing where tests are a first-class deliverable, not an afterthought
Implemented a submit queue that runs automated tests on every change before it merges to the trunk
Invested in build infrastructure (Blaze/Bazel) that can build and test only the affected portions of the codebase
Used percentage-based rollout for user-facing changes
Made rollback a one-click operation available to every team

Key Lessons

Test infrastructure is critical infrastructure. Google’s ability to deploy frequently depends entirely on its ability to test quickly and reliably.
Monorepo and CD are compatible. The common assumption that CD requires microservices with separate repos is false. Google deploys from a monorepo.
Invest in tooling before process. Google built the tooling (build systems, test infrastructure, deployment automation) that made good practices the path of least resistance.

Amazon: Two-Pizza Teams and Ownership

Context

Amazon’s transformation to service-oriented architecture and team ownership is one of the most influential in the industry. The “two-pizza team” model and “you build it, you run it” philosophy directly enabled continuous deployment.

The Challenge

Organizational size: Hundreds of thousands of employees
System complexity: Thousands of services powering amazon.com and AWS
Availability requirements: Even brief outages are front-page news
Pace of innovation: Competitive pressure demands rapid feature delivery

What They Did

Decomposed the system into independently deployable services, each owned by a small team
Gave teams full ownership: build, test, deploy, operate, and support
Built internal deployment tooling (Apollo) that automates canary analysis, rollback, and one-click deployment
Established the practice of deploying every commit that passes the pipeline, with automated rollback on metric degradation

Key Lessons

Ownership drives quality. When the team that writes the code also operates it in production, they write better code and build better monitoring.
Small teams move faster. Two-pizza teams (6-10 people) can make decisions without bureaucratic overhead.
Automation eliminates toil. Amazon’s internal deployment tooling means that deploying is not a skilled activity - any team member can deploy (and the pipeline usually deploys automatically).

HP: CD in Hardware-Adjacent Software

Context

HP’s LaserJet firmware team demonstrated that continuous delivery principles apply even to embedded software, a domain often considered incompatible with frequent deployment.

The Challenge

Embedded software: Firmware that runs on physical printers
Long development cycles: Firmware releases had traditionally been annual
Quality requirements: Firmware bugs require physical recalls or complex update procedures
Team size: Large, distributed teams with varying skill levels

What They Did

Invested in automated testing infrastructure for firmware
Reduced build times from days to under an hour
Moved from annual releases to frequent incremental updates
Implemented continuous integration with automated test suites running on simulator and hardware

Key Lessons

CD principles are universal. Even embedded firmware can benefit from small batches, automated testing, and continuous integration.
Build time is a critical constraint. Reducing build time from days to under an hour unlocked the ability to test frequently, which enabled frequent integration, which enabled frequent delivery.
Results were dramatic: Development costs reduced by approximately 40%, programs delivered on schedule increased by roughly 140%.

Flickr: “10+ Deploys Per Day”

Context

Flickr’s 2009 presentation “10+ Deploys Per Day: Dev and Ops Cooperation” is credited with helping launch the DevOps movement. At a time when most organizations deployed quarterly, Flickr was deploying more than ten times per day.

The Challenge

Web-scale service: Serving billions of photos to millions of users
Ops/Dev divide: Traditional separation between development and operations teams
Fear of change: Deployments were infrequent because they were risky

What They Did

Built automated infrastructure provisioning and deployment
Implemented feature flags to decouple deployment from release
Created a culture of shared responsibility between development and operations
Made deployment a routine, low-ceremony event that anyone could trigger
Used IRC bots (and later chat-based tools) to coordinate and log deployments

Key Lessons

Culture is the enabler. Flickr’s technical practices were important, but the cultural shift - developers and operations working together, shared responsibility, mutual respect - was what made frequent deployment possible.
Tooling should reduce friction. Flickr’s deployment tools were designed to make deploying as easy as possible. The easier it is to deploy, the more often people deploy, and the smaller each deployment becomes.
Transparency builds trust. Logging every deployment in a shared channel let everyone see what was deploying, who deployed it, and whether it caused problems. This transparency built organizational trust in frequent deployment.

Common Patterns Across Reports

Despite the diversity of these organizations, several patterns emerge consistently:

1. Investment in Automation Precedes Cultural Change

Every organization built the tooling first. Automated testing, automated deployment, automated rollback - these created the conditions where frequent deployment was possible. Cultural change followed when people saw that the automation worked.

2. Incremental Adoption, Not Big Bang

No organization switched to continuous deployment overnight. They all moved incrementally: shorter release cycles first, then weekly deploys, then daily, then on-demand. Each step built confidence for the next.

3. Team Ownership Is Essential

Organizations that gave teams ownership of their deployments (build it, run it) moved faster than those that kept deployment as a centralized function. Ownership creates accountability, which drives quality.

4. Feature Flags Are Universal

Every organization in these reports uses feature flags to decouple deployment from release. This is not optional for continuous deployment - it is foundational.

5. The Results Are Consistent

Regardless of industry, size, or starting point, organizations that adopt continuous deployment consistently report:

Higher deployment frequency (daily or more)
Lower change failure rate (small changes fail less)
Faster recovery (automated rollback, small blast radius)
Higher developer satisfaction (less toil, more impact)
Better business outcomes (faster time to market, reduced costs)

Applying These Lessons to Your Migration

You do not need to be Google-sized to benefit from these patterns. Extract what applies:

Start with automation. Build the pipeline, the tests, the rollback mechanism.
Adopt incrementally. Move from monthly to weekly to daily. Do not try to jump to 10 deploys per day on day one.
Give teams ownership. Let teams deploy their own services.
Use feature flags. Decouple deployment from release.
Measure and improve. Track DORA metrics. Run experiments. Use retrospectives.

These are the practices covered throughout this migration guide. The experience reports confirm that they work - not in theory, but in production, at scale, in the real world.

4 - CD for Greenfield Projects

Starting a new project? Build continuous delivery in from day one instead of retrofitting it later.

Starting with CD is dramatically easier than migrating to it. When there is no legacy process, no existing test suite to fix, and no entrenched habits to change, you can build the right practices from the first commit. This section shows you how.

Why Start with CD

Teams that build CD into a new project from the beginning avoid the most painful parts of the migration journey. There is no test suite to rewrite, no branching strategy to unwind, no deployment process to automate after the fact. Every practice described in this guide can be adopted on day one when there is no existing codebase to constrain you.

The cost of adopting CD practices in a greenfield project is near zero. The cost of retrofitting them into a mature codebase can be months of work. The earlier you start, the less it costs.

What to Build from Day One

Pipeline first

Before writing application code, set up your delivery pipeline. The pipeline is feature zero. Your first commit should include:

A build script that compiles, tests, and packages the application
A CI configuration that runs on every push to trunk
A deployment mechanism (even if the first “deployment” is to a local environment)
Every validation you know you will need from the start

The validations you put in the pipeline on day one define the quality standard for the application. They are not overhead you add later - they are the mold that shapes every line of code that follows. If you add linting after 10,000 lines of code, you are fixing 10,000 lines of code. If you add it before the first line, every line is written to the standard.

Feature zero validations:

Code style and formatting - Enforce a formatter (Prettier, Black, gofmt) so style is never a code review conversation. The pipeline rejects code that is not formatted.
Linting - Static analysis rules for your language (ESLint, pylint, golangci-lint). Catches bugs, enforces idioms, and prevents anti-patterns before review.
Type checking - If your language supports static types (TypeScript, mypy, Java), enable strict mode from the start. Relaxing later is easy. Tightening later is painful.
Test framework - The test runner is configured and a first test exists, even if it only asserts that the application starts. The team should never have to set up testing infrastructure - it is already there.
Security scanning - Dependency vulnerability scanning (Dependabot, Snyk, Trivy) and basic SAST rules. Security findings block the build from day one, so the team never accumulates a backlog of vulnerabilities.
Commit message or PR conventions - If you enforce conventional commits, changelog generation, or PR title formats, add the check now.

Every one of these is trivial to add to an empty project and expensive to retrofit into a mature codebase. The pipeline enforces them automatically, so the team never has to argue about them in review. The conversation shifts from “should we fix this?” to “the pipeline already enforces this.”

The pipeline should exist before the first feature. Every feature you build will flow through it and meet every standard you defined on day one.

Deploy “hello world” to production

Your first deployment should happen before your first feature. Deploy the simplest possible application - a health check endpoint, a static page, a “hello world” - all the way to production through your pipeline. This is the single most important validation you can do early because it proves the entire path works: build, test, package, deploy, verify.

Why production, not staging: The goal is to prove the full path works end-to-end. If you deploy only to a staging environment, you have proven that the pipeline works up to staging. You have not proven that production credentials, network routes, DNS, load balancers, permissions, and deployment targets are correctly configured. Every gap between your test environment and production is an assumption that will be tested for the first time under pressure, when it matters most.

Deploy “hello world” to production on day one, and you will discover:

Whether the team has the access and permissions to deploy
Whether the infrastructure provisioning actually works
Whether the deployment mechanism handles a real production environment
Whether monitoring and health checks are wired up correctly
Whether rollback works before you need it in an emergency

All of these are problems you want to find with a “hello world,” not with a real feature under a deadline.

Warning: deploying only to lower environments

If organizational constraints prevent you from deploying to production immediately, deploy as close to production as you can. But be explicit about what this means: every environment that is not production is an approximation. Lower environments may differ in network topology, security policies, resource capacity, data volume, and third-party integrations. Each difference is a gap in your confidence.

Track these gaps. Document every known difference between your deployment target and production. Treat closing each gap as a priority, because until you have deployed to production through your pipeline, you have not fully validated the path. The longer you wait, the more assumptions accumulate, and the riskier the first real production deployment becomes.

Trunk-based development from the start

There is no reason to start with long-lived branches. From commit one:

All work happens on trunk (or short-lived branches that merge to trunk within a day)
The pipeline runs on every integration to trunk
Trunk is always in a deployable state

See Trunk-Based Development for the practices.

Test architecture from the start

Design your test architecture before you have tests to migrate. Establish:

Unit tests for all business logic
Integration tests for every external boundary (databases, APIs, message queues)
Functional tests that exercise your service in isolation with test doubles for dependencies
Contract tests for every external dependency
A clear rule: everything that blocks deployment is deterministic

See Testing Fundamentals for the full test architecture.

Small, vertical slices from the start

Decompose the first features into small, independently deployable increments. Establish the habit of delivering thin vertical slices before the team has a chance to develop a batch mindset.

See Work Decomposition for slicing techniques.

Greenfield Checklist

Use this checklist to verify your new project is set up for CD from the start.

Week 1

CI pipeline runs on every push to trunk
Build, test, and package happen with a single command
First unit test exists and passes
All work integrates to trunk at least daily
Deployment to at least one environment is automated

Month 1

Test architecture established (unit, integration, functional layers)
External dependencies use test doubles in the deterministic test suite
Contract tests exist for at least one external dependency
Pipeline deploys to a production-like environment
Rollback is tested and works
Application configuration is externalized
Artifacts are immutable (build once, deploy everywhere)

Month 3

Pipeline deploys to production
Every commit that passes the pipeline is a deployment candidate
Deployment is a routine, low-risk event
Feature flags decouple deployment from release
DORA metrics are tracked (deployment frequency, lead time, change failure rate, MTTR)

Common Mistakes in Greenfield Projects

Mistake	Why it happens	What to do instead
“We’ll add tests later”	Pressure to show progress on features	Write the first test before the first feature. TDD from day one.
“We’ll set up the pipeline later”	Pipeline feels like overhead when there’s little code	The pipeline is the first thing you build. Features flow through it.
Starting with feature branches	Habit from previous projects	Trunk-based development from commit one. No reason to start with branches.
Designing for scale before you have users	Over-engineering from the start	Build the simplest thing that works. Deploy frequently. Evolve the architecture based on real feedback.
Skipping contract tests because “we own both services”	Feels redundant when one team owns everything	You will not own everything forever. Contract tests are cheap to add early and expensive to add later.

Testing Fundamentals - Build the right test architecture from the start
Trunk-Based Development - The branching model for CD
Pipeline Architecture - Design your pipeline structure
Work Decomposition - Deliver in small, vertical slices
Feature Flags - Decouple deployment from release

5 - Defect Sources

A catalog of defect causes across the delivery value stream with detection methods, AI enhancement opportunities, and systemic fixes.

Adapted from AI Patterns: Defect Detection

Defects do not appear randomly. They originate from specific, predictable sources in the delivery value stream. This reference catalogs those sources so teams can shift detection left, automate where possible, and apply AI to accelerate the feedback loop.

Product & Discovery

These defects originate before a single line of code is written. They are the most expensive to fix because they compound through every downstream phase.

Defect Cause	Detection Method	AI Enhancement	Fix
Building the wrong thing	Adoption dashboards, user research validation	Synthesize user feedback, support tickets, and usage data to surface misalignment earlier than production metrics	Validated user research before backlog entry; dual-track agile
Solving a problem nobody has	Problem validation stage gate, user interviews	Analyze support tickets and feature requests to identify real vs. assumed pain points	Problem validation as a stage gate; publish problem brief before solution
Correct problem, wrong solution	Prototype testing, A/B experiments	Compare proposed solution against prior approaches in similar domains	Prototype multiple approaches; measurable success criteria first
Meets spec but misses user intent	User acceptance testing, session recordings	Review acceptance criteria against user behavior data to flag misalignment	Acceptance criteria focused on user outcomes, not checklists
Over-engineering beyond need	Code complexity metrics, architecture review	Flag unnecessary abstraction layers and unused extension points	YAGNI principle; justify every abstraction layer
Prioritizing wrong work	Outcome tracking, opportunity scoring	Automated WSJF scoring using historical outcome data	WSJF prioritization with outcome data

Integration & Boundaries

Defects at system boundaries are invisible to unit tests and often survive until production. Contract testing and deliberate boundary design are the primary defenses.

Defect Cause	Detection Method	AI Enhancement	Fix
Interface mismatches	Contract tests (Pact, OpenAPI, buf)	Compare API schemas across versions to detect breaking changes before merge	Mandatory contract tests per boundary; API-first with generated clients
Wrong assumptions about upstream/downstream	Integration tests, behavioral contract documentation	Analyze call patterns across services to detect undocumented behavioral expectations	Document behavioral contracts; defensive coding at boundaries
Race conditions	Thread sanitizers, concurrency testing	Static analysis for concurrent access patterns; suggest idempotent alternatives	Idempotent design; queues over shared mutable state
Inconsistent distributed state	Distributed tracing (Jaeger, Zipkin), chaos engineering	Anomaly detection across distributed state to flag synchronization failures	Deliberate consistency model choices; saga with compensation logic

Knowledge & Communication

These defects emerge from gaps between what people know and what the code expresses. They are the hardest to detect with automated tools and the easiest to prevent with team practices.

Defect Cause	Detection Method	AI Enhancement	Fix
Implicit domain knowledge not in code	Knowledge concentration metrics, code review	Generate documentation from code and tests; flag where docs have drifted from implementation	Domain-Driven Design with ubiquitous language; embed rules in code
Ambiguous requirements	Three Amigos sessions, example mapping	Review requirements for ambiguity, missing edge cases, and contradictions; generate test scenarios	Three Amigos before work; example mapping; executable specs
Tribal knowledge loss	Bus factor analysis, documentation coverage	Identify knowledge silos by analyzing commit patterns and code ownership concentration	Pair/mob programming as default; rotate on-call; living docs
Divergent mental models across teams	Cross-team reviews, shared domain models	Compare terminology and domain models across codebases to detect semantic mismatches	Shared domain models; explicit bounded contexts

Change & Complexity

These defects are caused by the act of changing existing code. The larger the change and the longer it lives outside trunk, the higher the risk.

Defect Cause	Detection Method	AI Enhancement	Fix
Unintended side effects	Mutation testing (Stryker, PIT), regression suites	Automated blast radius analysis from change diffs; flag high-risk modifications	Small focused commits; trunk-based development; feature flags
Accumulated technical debt	Code complexity trends (CodeScene), static analysis	Track complexity trends and predict which modules are approaching failure thresholds	Refactoring as part of every story; dedicated debt budget
Unanticipated feature interactions	Feature flag testing, canary deployments	Analyze feature flag combinations to predict interaction conflicts	Feature flags with controlled rollout; modular design; canary deployments
Configuration drift	Infrastructure as code validation, environment diffing	Detect configuration differences across environments automatically	Infrastructure as code; immutable infrastructure; GitOps

Testing & Observability Gaps

These defects survive because the safety net has holes. The fix is not more testing - it is better-targeted testing and observability that closes the specific gaps.

Defect Cause	Detection Method	AI Enhancement	Fix
Untested edge cases and error paths	Property-based testing (Hypothesis, fast-check), boundary analysis	Generate edge case test scenarios from code analysis; identify untested paths	Property-based testing as standard; boundary value analysis
Missing contract tests at boundaries	Boundary inventory audit, integration failure analysis	Scan service boundaries and flag missing contract test coverage	Mandatory contract tests per new boundary
Insufficient monitoring	SLO tracking, incident post-mortems	Analyze production incidents to recommend missing monitoring and alerting	Observability as non-functional requirement; SLOs for every user-facing path
Test environments don’t reflect production	Environment parity checks, deployment failure analysis	Compare environment configurations to flag meaningful differences	Production-like data in staging; test in production with flags

Process & Deployment

These defects are caused by the delivery process itself. Manual steps, large batches, and slow feedback loops create the conditions for failure.

Defect Cause	Detection Method	AI Enhancement	Fix
Long-lived branches	Branch age metrics, merge conflict frequency	Flag branches exceeding age thresholds; predict merge conflict probability	Trunk-based development; merge at least daily
Manual pipeline steps	Value stream mapping, deployment audit	Identify manual steps in the pipeline that can be automated	Automate every step commit-to-production
Batching too many changes per release	Deployment frequency metrics, change failure correlation	Correlate batch size with failure rates to quantify the cost of large batches	Continuous delivery; every commit is a candidate
Inadequate rollback capability	Rollback testing, incident recovery time	Automated risk scoring from change diff and deployment history	Blue/green or canary deployments; auto-rollback on health failure
Reliance on human review to catch preventable defects	Defect origin analysis, review effectiveness metrics	Identify defects caught in review that could be caught by automated tools	Reserve human review for knowledge transfer and design decisions
Manual review of risks and compliance (CAB)	Change lead time analysis, CAB effectiveness metrics	Automated change risk scoring to replace subjective risk assessment	Replace CAB with automated progressive delivery

Data & State

Data defects are particularly dangerous because they can corrupt persistent state. Unlike code defects, data corruption often cannot be fixed by deploying a new version.

Defect Cause	Detection Method	AI Enhancement	Fix
Schema migration and backward compatibility failures	Migration testing, schema version tracking	Analyze schema changes for backward compatibility violations before merge	Expand-then-contract schema migrations; never breaking changes
Null or missing data assumptions	Null safety analysis (NullAway, TypeScript strict), property testing	Static analysis for null safety; flag unhandled optional values	Null-safe type systems; Option/Maybe as default; validate at boundaries
Concurrency and ordering issues	Distributed tracing, idempotency testing	Detect patterns vulnerable to out-of-order delivery	Design for out-of-order delivery; idempotent consumers
Cache invalidation errors	Cache hit/miss monitoring, stale data detection	Analyze cache invalidation patterns and flag potential staleness windows	Short TTLs; event-driven invalidation

Dependency & Infrastructure

These defects originate outside your codebase but break your system. The fix is to treat external dependencies as untrusted boundaries.

Defect Cause	Detection Method	AI Enhancement	Fix
Third-party library breaking changes	Dependency scanning (Dependabot, Renovate), automated upgrade PRs	Analyze changelog and API diffs to predict breaking impact before upgrade	Pin dependencies; automated upgrade PRs with test gates
Infrastructure differences across environments	Infrastructure as code validation, environment parity checks	Compare infrastructure definitions across environments to flag drift	Single source of truth for all environments; containerization
Network partitions and partial failures handled wrong	Chaos engineering (Gremlin, Litmus), failure injection testing	Analyze error handling code for missing failure modes	Circuit breakers; retries; bulkheads as defaults; test failure modes explicitly

From Reactive to Proactive

Systemic Thinking

The traditional approach to defects is reactive: wait for a bug, find it, fix it. The catalog above enables a proactive approach: understand where defects originate, detect them at the earliest possible point, and fix the systemic cause rather than the individual symptom.

AI enhances this shift by processing signals (code changes, test results, production metrics, user feedback) faster and across more dimensions than manual analysis allows. But AI does not replace the systemic fixes. Automated detection without process change just finds defects faster without preventing them.

The goal is not zero defects. The goal is defects caught at the cheapest point in the value stream, with systemic fixes that prevent the same category of defect from recurring.

Common Blockers - Frequently encountered obstacles on the path to CD
Testing - Testing types, patterns, and best practices
Anti-Patterns - Patterns that undermine delivery performance
Replacing Manual Validations - The mechanical cycle of replacing manual checks with automation
AI Adoption Roadmap - How to safely incorporate AI into your delivery process

6 - AI Adoption Roadmap

A prescriptive guide for incorporating AI into your delivery process safely - remove friction and add safety before accelerating with AI coding.

Adapted from Incorporating AI Without Crashing

AI adoption is chaos testing for your organization. It does not create new problems - it reveals existing ones. Teams that try to accelerate with AI before fixing their delivery process get the same result as putting a bigger engine in a car with no way to steer or stop: you go faster, briefly, and then something expensive happens. This page provides the prescriptive sequence for incorporating AI safely, mirroring the brownfield migration phases.

The Key Insight

AI amplifies whatever system it is applied to. If your delivery process has strong guardrails, fast feedback, and clear requirements, AI makes you faster. If your process has unclear requirements, manual gates, fragile tests, and slow pipelines, AI makes those problems worse - and it makes them worse faster.

The sequence matters: remove friction and add safety before you accelerate.

The Progression

graph LR
    A["1. Quality Tools"] --> B["2. Clarify Work"]
    B --> C["3. Harden Guardrails"]
    C --> D["4. Remove Friction"]
    D --> E["5. Accelerate with AI"]

    style A fill:#e8f4fd,stroke:#1a73e8
    style B fill:#e8f4fd,stroke:#1a73e8
    style C fill:#fce8e6,stroke:#d93025
    style D fill:#fce8e6,stroke:#d93025
    style E fill:#e6f4ea,stroke:#137333

Each step builds on the previous one. Skipping steps means AI amplifies your weaknesses instead of your strengths.

Step 1: Start with Quality Tools

Brownfield phase: Assess

Before using AI for anything, choose models and tools that minimize hallucination and rework. Not all AI tools are equal. A model that generates plausible-looking but incorrect code creates more work than it saves.

What to do:

Evaluate AI coding tools on accuracy, not speed. A tool that generates correct code 80% of the time and incorrect code 20% of the time has a hidden rework tax on every use.
Use models with strong reasoning capabilities for code generation. Smaller, faster models are appropriate for autocomplete and suggestions, not for generating business logic.
Establish a baseline: measure how much rework AI-generated code requires before and after changing tools. If rework exceeds 20% of generated output, the tool is a net negative.

What this enables: A foundation of AI tooling that generates correct output more often than not, so subsequent steps build on working code rather than compensating for broken code.

Step 2: Clarify Work Before Coding

Brownfield phase: Assess / Foundations

Use AI to improve requirements before code is written, not to write code from vague requirements. Ambiguous requirements are the single largest source of defects (see Defect Sources), and AI can detect ambiguity faster than manual review.

What to do:

Use AI to review tickets, user stories, and acceptance criteria before development begins. Prompt it to identify gaps, contradictions, untestable statements, and missing edge cases.
Use AI to generate test scenarios from requirements. If the AI cannot generate clear test cases, the requirements are not clear enough for a human either.
Use AI to analyze support tickets and incident reports for patterns that should inform the backlog.

What this enables: Higher-quality inputs to the development process. Developers (human or AI) start with clear, testable specifications rather than ambiguous descriptions that produce ambiguous code.

Step 3: Harden Guardrails

Brownfield phase: Foundations / Pipeline

Before accelerating code generation, strengthen the safety net that catches mistakes. This means both product guardrails (does the code work?) and development guardrails (is the code maintainable?).

Product and operational guardrails:

Automated test suites with meaningful coverage of critical paths
Deterministic CI/CD pipelines that run on every commit
Deployment validation (smoke tests, health checks, canary analysis)

Development guardrails:

Code style enforcement (linters, formatters) that runs automatically
Architecture rules (dependency constraints, module boundaries) enforced in the pipeline
Security scanning (SAST, dependency vulnerability checks) on every commit

What to do:

Audit your current guardrails. For each one, ask: “If AI generated code that violated this, would our pipeline catch it?” If the answer is no, fix the guardrail before expanding AI use.
Add contract tests at service boundaries. AI-generated code is particularly prone to breaking implicit contracts between services.
Ensure test suites run in minutes, not hours. Slow tests create pressure to skip them, which is dangerous when code is generated faster.

What this enables: A safety net that catches mistakes regardless of who (or what) made them. The pipeline becomes the authority on code quality, not human reviewers.

Step 4: Reduce Delivery Friction

Brownfield phase: Pipeline / Optimize

Remove the manual steps, slow processes, and fragile environments that limit how fast you can safely deliver. These bottlenecks exist in every brownfield system and they become acute when AI accelerates the code generation phase.

What to do:

Remove manual approval gates that add wait time without adding safety (see Replacing Manual Validations).
Fix fragile test and staging environments that cause intermittent failures.
Shorten branch lifetimes. If branches live longer than a day, integration pain will increase as AI accelerates code generation.
Automate deployment. If deploying requires a runbook or a specific person, it is a bottleneck that will be exposed when code moves faster.

What this enables: A delivery pipeline where the time from “code complete” to “running in production” is measured in minutes, not days. When this path is fast and reliable, AI-generated code flows through the same pipeline as human-generated code with the same safety guarantees.

Step 5: Accelerate with AI Coding

Brownfield phase: Optimize / Continuous Deployment

Now - and only now - expand AI use to code generation, refactoring, and autonomous contributions. The guardrails are in place. The pipeline is fast. Requirements are clear. The outcome of every change is deterministic regardless of whether a human or an AI wrote it.

What to do:

Use AI for code generation with the test-first workflow described in Agentic CD. Write tests first, then let AI generate the implementation.
Use AI for refactoring: extracting interfaces, reducing complexity, improving test coverage. These are high-value, low-risk tasks where AI excels.
Use AI to analyze incidents and suggest fixes, with the same pipeline validation applied to any change.

What this enables: AI-accelerated development where the speed increase translates to faster delivery, not faster defect generation. The pipeline enforces the same quality bar regardless of the author.

Mapping to Brownfield Phases

AI Adoption Step	Brownfield Phase	Key Connection
1. Quality Tools	Assess	Evaluate tooling as part of current-state assessment
2. Clarify Work	Assess / Foundations	Better requirements feed better work decomposition
3. Harden Guardrails	Foundations / Pipeline	Same testing and pipeline work, with AI-readiness as additional motivation
4. Remove Friction	Pipeline / Optimize	Same automation and flow optimization, unblocking AI-speed delivery
5. Accelerate with AI	Optimize / CD	AI coding becomes safe when the pipeline is deterministic and fast

The Destination: Agentic CD

The end state of this roadmap is a delivery pipeline where AI agents can contribute code with the same safety guarantees as human developers. This is Agentic CD - the extension of continuous deployment to handle agent-generated changes. You do not need to be at CD maturity to start the AI adoption roadmap, but the roadmap leads there.

Brownfield CD Overview - The phased migration approach this roadmap parallels
Replacing Manual Validations - The core mechanical cycle for Step 4
Defect Sources - Catalog of defect causes that AI can help detect (Step 2)
Agentic CD - The destination for teams completing this roadmap
Anti-Patterns - Problems that Steps 3 and 4 are designed to eliminate
Common Blockers - Obstacles you will encounter along the way

7 - FAQ

Frequently asked questions about continuous delivery and this migration guide.

Adapted from MinimumCD.org

About This Guide

Why does this migration guide exist?

Many teams say they want to adopt continuous delivery but do not know where to start. The CD landscape is full of tools, frameworks, and advice, but there is no clear, sequenced path from “we deploy monthly” to “we can deploy any change at any time.” This guide provides that path.

It is built on the MinimumCD definition of continuous delivery and draws on practices from the Dojo Consortium and the DORA research. The content is organized as a migration – a phased journey from your current state to continuous delivery – rather than as a description of what CD looks like when you are already there.

Who is this guide for?

This guide is for development teams, tech leads, and engineering managers who want to improve their software delivery practices. It is designed for teams that are currently deploying infrequently (monthly, quarterly, or less) and want to reach a state where any change can be deployed to production at any time.

You do not need to be starting from zero. If your team already has CI in place, you can begin with Phase 2 – Pipeline. If you have a pipeline but deploy infrequently, start with Phase 3 – Optimize. Use the Phase 0 assessment to find your starting point.

Should we adopt this guide as an organization or as a team?

Start with a single team. CD adoption works best when a team can experiment, learn, and iterate without waiting for organizational consensus. Once one team demonstrates results – shorter lead times, lower change failure rate, more frequent deployments – other teams will have a concrete example to follow.

Organizational adoption comes after team adoption, not before. The role of organizational leadership is to create the conditions for teams to succeed: stable team composition, tool funding, policy flexibility for deployment processes, and protection from pressure to cut corners on quality.

How do we use this guide for improvement?

Start with Phase 0 – Assess. Map your value stream, measure your current performance, and identify your top constraints. Then work through the phases in order, focusing on one constraint at a time.

The guide is not a checklist to complete in sequence. It is a reference that helps you decide what to work on next. Some teams will spend months in Phase 1 building testing fundamentals. Others will move quickly to Phase 2 because they already have strong development practices. Your value stream map and metrics tell you where to invest.

Revisit your assessment periodically. As you improve, new constraints will emerge. The phases give you a framework for addressing them.

Continuous Delivery Concepts

What is the difference between continuous delivery and continuous deployment?

Continuous delivery means every change to the codebase is always in a deployable state and can be released to production at any time through a fully automated pipeline. The decision to deploy may still be made by a human, but the capability to deploy is always present.

Continuous deployment is an extension of continuous delivery where every change that passes the automated pipeline is deployed to production without manual intervention.

This migration guide takes you through continuous delivery (Phases 0-3) and then to continuous deployment (Phase 4). Continuous delivery is the prerequisite. You cannot safely automate deployment decisions until your pipeline reliably determines what is deployable.

Is continuous delivery the same as having a CI/CD pipeline?

No. Many teams have a CI/CD pipeline tool (Jenkins, GitHub Actions, GitLab CI, etc.) but are not practicing continuous delivery. A pipeline tool is necessary but not sufficient.

Continuous delivery requires:

Trunk-based development – all developers integrating to trunk at least daily
Comprehensive test automation – fast, reliable tests that catch real defects
A single path to production – every change goes through the same automated pipeline
Immutable artifacts – build once, deploy the same artifact everywhere
The ability to deploy any green build – not just special “release” builds

If your team has a pipeline but uses long-lived feature branches, deploys only at the end of a sprint, or requires manual testing before a release, you have a pipeline tool but you are not practicing continuous delivery. The current-state checklist in Phase 0 helps you assess the gap.

What does “the pipeline is the only path to production” mean?

It means there is exactly one way for any change to reach production: through the automated pipeline. No one can SSH into a server and make a change. No one can skip the test suite for an “urgent” fix. No one can deploy from their local machine.

This constraint is what gives you confidence. If every change in production has been through the same build, test, and deployment process, you know what is running and how it got there. If exceptions are allowed, you lose that guarantee, and your ability to reason about production state degrades.

During your migration, establishing this single path is a key milestone in Phase 2.

What does “application configuration” mean in the context of CD?

Application configuration refers to values that change between environments but are not part of the application code: database connection strings, API endpoints, feature flag states, logging levels, and similar settings.

In a CD pipeline, configuration is externalized – it lives outside the artifact and is injected at deployment time. This is what makes immutable artifacts possible. You build the artifact once and deploy it to any environment by providing the appropriate configuration.

If configuration is embedded in the artifact (for example, hardcoded URLs or environment-specific config files baked into a container image), you must rebuild the artifact for each environment, which means the artifact you tested is not the artifact you deploy. This breaks the immutability guarantee. See Application Config.

What is an “immutable artifact” and why does it matter?

An immutable artifact is a build output (container image, binary, package) that is never modified after it is created. The exact artifact that passes your test suite is the exact artifact that is deployed to staging, and then to production. Nothing is recompiled, repackaged, or patched between environments.

This matters because it eliminates an entire category of deployment failures: “it worked in staging but not in production” caused by differences in the build. If the same bytes are deployed everywhere, build-related discrepancies are impossible.

Immutability requires externalizing configuration (see above) and storing artifacts in a registry or repository. See Immutable Artifacts.

What does “deployable” mean?

A change is deployable when it has passed all automated quality gates defined in the pipeline. The definition is codified in the pipeline itself, not decided by a person at deployment time.

A typical deployable definition includes:

All unit tests pass
All integration tests pass
All functional tests pass
Static analysis checks pass (linting, security scanning)
The artifact is built and stored in the artifact registry
Deployment to a production-like environment succeeds
Smoke tests in the production-like environment pass

If any of these gates fail, the change is not deployable. The pipeline makes this determination automatically and consistently. See Deployable Definition.

What is the difference between deployment and release?

Deployment is the act of putting code into a production environment.

Release is the act of making functionality available to users.

These are different events, and decoupling them is one of the most powerful techniques in CD. You can deploy code to production without releasing it to users by using feature flags. The code is running in production, but the new functionality is disabled. When you are ready, you enable the flag and the feature is released.

This decoupling is important because it separates the technical risk (will the deployment succeed?) from the business risk (will users like the feature?). You can manage each risk independently. Deployments become routine technical events. Releases become deliberate business decisions.

Migration Questions

How long does the migration take?

It depends on where you start and how much organizational support you have. As a rough guide:

Phase 0 (Assess): 1-2 weeks
Phase 1 (Foundations): 1-6 months, depending on current testing and TBD maturity
Phase 2 (Pipeline): 1-3 months
Phase 3 (Optimize): 2-6 months
Phase 4 (Deliver on Demand): 1-3 months

These ranges assume a single team working on the migration alongside regular delivery work. The biggest variable is Phase 1: teams with no test automation or TBD practice will spend longer building foundations than teams that already have these in place.

Do not treat these timelines as commitments. The migration is an iterative improvement process, not a project with a deadline.

Do we stop delivering features during the migration?

No. The migration is done alongside regular delivery work, not instead of it. Each migration practice is adopted incrementally: you do not stop the world to rewrite your test suite or redesign your pipeline.

For example, in Phase 1 you adopt trunk-based development by reducing branch lifetimes gradually – from two weeks to one week to two days to same-day. You add automated tests incrementally, starting with the highest-risk code paths. You decompose work into smaller stories one sprint at a time.

The migration practices themselves improve your delivery speed, so the investment pays off as you go. Teams that have completed Phase 1 typically report delivering features faster than before, not slower.

What if our organization requires manual change approval (CAB)?

Many organizations have Change Advisory Board (CAB) processes that require manual approval before production deployments. This is one of the most common organizational blockers for CD.

The path forward is to replace the manual approval with automated evidence. A CAB exists because the organization lacks confidence that changes are safe. Your CD pipeline, when mature, provides stronger evidence of safety than a committee meeting:

Every change has passed comprehensive automated tests
The exact artifact that was tested is the one being deployed
Rollback is automated and takes minutes
Deployment is a routine event that happens many times per week

Use your DORA metrics to demonstrate that automated pipelines produce lower change failure rates than manual approval processes. Most CAB processes were designed for a world of monthly releases with hundreds of changes per batch. When you deploy daily with one or two changes per deployment, the risk profile is fundamentally different.

This is a gradual conversation, not a one-time negotiation. Start by inviting CAB representatives to observe your pipeline. Show them the test results, the deployment logs, the rollback capability. Build trust through evidence.

What if we have a monolithic architecture?

You can practice continuous delivery with a monolith. CD does not require microservices. Many of the highest-performing teams in the DORA research deploy monolithic applications multiple times per day.

What matters is that your architecture supports independent testing and deployment. A well-structured monolith with a comprehensive test suite and a reliable pipeline can achieve CD. A poorly structured collection of microservices with shared databases and coordinated releases cannot.

Architecture decoupling is addressed in Phase 3, but it is about enabling independent deployment and reducing coordination costs, not about adopting any particular architectural style.

What if our tests are slow or unreliable?

This is one of the most common starting conditions. A slow or flaky test suite undermines every CD practice: developers stop trusting the tests, broken builds are ignored, and the pipeline becomes a bottleneck rather than an enabler.

The solution is incremental, not wholesale:

Delete or quarantine flaky tests. A test that sometimes passes and sometimes fails provides no signal. Remove it from the pipeline and fix it or replace it.
Parallelize what you can. Many test suites are slow because they run sequentially. Parallelization is often the fastest way to reduce pipeline duration.
Rebalance the test pyramid. If most of your automated tests are end-to-end or UI tests, they will be slow and brittle. Invest in unit and integration tests that run in milliseconds and reserve end-to-end tests for critical paths only.
Set a time budget. Your full pipeline – build, test, deploy to a staging environment – should complete in under 10 minutes. If it takes longer, that is a constraint to address.

See Testing Fundamentals and the Testing reference section for detailed guidance.

Where do I start if I am not sure which phase applies to us?

Start with Phase 0 – Assess. Complete the value stream mapping exercise, take baseline metrics, and fill out the current-state checklist. These activities will tell you exactly where you stand and which phase to begin with.

If you do not have time for a full assessment, ask yourself these questions:

Do all developers integrate to trunk at least daily? If no, start with Phase 1.
Do you have a single automated pipeline that every change goes through? If no, start with Phase 2.
Can you deploy any green build to production on demand? If no, focus on the gap between your current state and Phase 2 completion criteria.
Do you deploy at least weekly? If no, look at Phase 3 for batch size and flow optimization.

8 - Reference

Supporting material: glossary, metrics definitions, testing guides, and additional resources.

This section provides reference material that supports your migration journey. Use it alongside the phase guides for detailed definitions, metrics, and patterns.

Glossary - Key terms and definitions
CD Dependency Tree - How CD practices depend on each other
Common Blockers - Frequently encountered obstacles and how to address them
Defect Sources - Defect causes across the delivery value stream with detection methods and AI enhancements
DORA Capabilities - The capabilities that drive software delivery performance
Resources - Books, videos, and further reading
Metrics - Detailed definitions for key delivery metrics
Testing - Testing types, patterns, and best practices

8.1 - Glossary

Key terms and definitions used throughout this guide.

Adapted from Dojo Consortium

This glossary defines the terms used across every phase of the CD migration guide. Where a term has a specific meaning within a migration phase, the relevant phase is noted.

A

Artifact

A packaged, versioned output of a build process (e.g., a container image, JAR file, or binary). In a CD pipeline, artifacts are built once and promoted through environments without modification. See Immutable Artifacts.

B

Baseline Metrics

The set of delivery measurements taken before beginning a migration, used as the benchmark against which improvement is tracked. See Phase 0 – Baseline Metrics.

Batch Size

The amount of change included in a single deployment. Smaller batches reduce risk, simplify debugging, and shorten feedback loops. Reducing batch size is a core focus of Phase 3 – Small Batches.

BDD (Behavior-Driven Development)

A collaboration practice where developers, testers, and product representatives define expected behavior using structured examples before code is written. BDD produces executable specifications that serve as both documentation and automated tests. BDD supports effective work decomposition by forcing clarity about what a story actually means before development begins.

Blue-Green Deployment

A deployment strategy that maintains two identical production environments. New code is deployed to the inactive environment, verified, and then traffic is switched. See Progressive Rollout.

Branch Lifetime

The elapsed time between creating a branch and merging it to trunk. CD requires branch lifetimes measured in hours, not days or weeks. Long branch lifetimes are a symptom of poor work decomposition or slow code review. See Trunk-Based Development.

C

Canary Deployment

A deployment strategy where a new version is rolled out to a small subset of users or servers before full rollout. If the canary shows no issues, the deployment proceeds to 100%. See Progressive Rollout.

CD (Continuous Delivery)

The practice of ensuring that every change to the codebase is always in a deployable state and can be released to production at any time through a fully automated pipeline. Continuous delivery does not require that every change is deployed automatically, but it requires that every change could be deployed automatically. This is the primary goal of this migration guide.

Change Failure Rate (CFR)

The percentage of deployments to production that result in a degraded service and require remediation (e.g., rollback, hotfix, or patch). One of the four DORA metrics. See Metrics – Change Fail Rate.

CI (Continuous Integration)

The practice of integrating code changes to a shared trunk at least once per day, where each integration is verified by an automated build and test suite. CI is a prerequisite for CD, not a synonym. A team that runs automated builds on feature branches but merges weekly is not doing CI. See Build Automation.

Constraint

In the Theory of Constraints, the single factor most limiting the throughput of a system. During a CD migration, your job is to find and fix constraints in order of impact. See Identify Constraints.

Continuous Deployment

An extension of continuous delivery where every change that passes the automated pipeline is deployed to production without manual intervention. Continuous delivery ensures every change can be deployed; continuous deployment ensures every change is deployed. See Phase 4 – Deliver on Demand.

D

Deployable

A change that has passed all automated quality gates defined by the team and is ready for production deployment. The definition of deployable is codified in the pipeline, not decided by a person at deployment time. See Deployable Definition.

Deployment Frequency

How often an organization successfully deploys to production. One of the four DORA metrics. See Metrics – Release Frequency.

Development Cycle Time

The elapsed time from the first commit on a change to that change being deployable. This measures the efficiency of your development and pipeline process, excluding upstream wait times. See Metrics – Development Cycle Time.

DORA Metrics

The four key metrics identified by the DORA (DevOps Research and Assessment) research program as predictive of software delivery performance: deployment frequency, lead time for changes, change failure rate, and mean time to restore service. See DORA Capabilities.

F

Feature Flag

A mechanism that allows code to be deployed to production with new functionality disabled, then selectively enabled for specific users, percentages of traffic, or environments. Feature flags decouple deployment from release. See Feature Flags.

Flow Efficiency

The ratio of active work time to total elapsed time in a delivery process. A flow efficiency of 15% means that for every hour of actual work, roughly 5.7 hours are spent waiting. Value stream mapping reveals your flow efficiency. See Value Stream Mapping.

H

Hard Dependency

A dependency that must be resolved before work can proceed. In delivery, hard dependencies include things like waiting for another team’s API, a shared database migration, or an infrastructure provisioning request. Hard dependencies create queues and increase lead time. Eliminating hard dependencies is a focus of Architecture Decoupling.

Hardening Sprint

A sprint dedicated to stabilizing and fixing defects before a release. The existence of hardening sprints is a strong signal that quality is not being built in during regular development. Teams practicing CD do not need hardening sprints because every commit is deployable. See Common Blockers.

I

Immutable Artifact

A build artifact that is never modified after creation. The same artifact that is tested in the pipeline is the exact artifact that is deployed to production. Configuration differences between environments are handled externally. See Immutable Artifacts.

Integration Frequency

How often a developer integrates code to the shared trunk. CD requires at least daily integration. See Metrics – Integration Frequency.

L

Lead Time for Changes

The elapsed time from when a commit is made to when it is successfully running in production. One of the four DORA metrics. See Metrics – Lead Time.

M

Mean Time to Restore (MTTR)

The elapsed time from when a production incident is detected to when service is restored. One of the four DORA metrics. Teams practicing CD have short MTTR because deployments are small, rollback is automated, and the cause of failure is easy to identify. See Metrics – Mean Time to Repair.

P

Pipeline

The automated sequence of build, test, and deployment stages that every change passes through on its way to production. See Phase 2 – Pipeline.

Production-Like Environment

A test or staging environment that matches production in configuration, infrastructure, and data characteristics. Testing in environments that differ from production is a common source of deployment failures. See Production-Like Environments.

R

Rollback

The ability to revert a production deployment to a previous known-good state. CD requires automated rollback that takes minutes, not hours. See Rollback.

S

Soft Dependency

A dependency that can be worked around or deferred. Unlike hard dependencies, soft dependencies do not block work but may influence sequencing or design decisions. Feature flags can turn many hard dependencies into soft dependencies by allowing incomplete integrations to be deployed in a disabled state.

Story Points

A relative estimation unit used by some teams to forecast effort. Story points are frequently misused as a productivity metric, which creates perverse incentives to inflate estimates and discourages the small work decomposition that CD requires. If your organization uses story points as a velocity target, see Common Blockers.

T

TBD (Trunk-Based Development)

A source-control branching model where all developers integrate to a single shared branch (trunk) at least once per day. Short-lived feature branches (less than a day) are acceptable. Long-lived feature branches are not. TBD is a prerequisite for CI, which is in turn a prerequisite for CD. See Trunk-Based Development.

TDD (Test-Driven Development)

A development practice where tests are written before the production code that makes them pass. TDD supports CD by ensuring high test coverage, driving simple design, and producing a fast, reliable test suite. TDD feeds into the testing fundamentals required in Phase 1.

Toil

Repetitive, manual work related to maintaining a production service that is automatable, has no lasting value, and scales linearly with service size. Examples include manual deployments, manual environment provisioning, and manual test execution. Eliminating toil is a primary benefit of building a CD pipeline.

U

Unplanned Work

Work that arrives outside the planned backlog – production incidents, urgent bug fixes, ad hoc requests. High levels of unplanned work indicate systemic quality or operational problems. Teams with high change failure rates generate their own unplanned work through failed deployments. Reducing unplanned work is a natural outcome of improving change failure rate through CD practices.

V

Value Stream Map

A visual representation of every step required to deliver a change from request to production, showing process time, wait time, and percent complete and accurate at each step. The foundational tool for Phase 0 – Assess.

Vertical Sliced Story

A user story that delivers a thin slice of functionality across all layers of the system (UI, API, database, etc.) rather than a horizontal slice that implements one layer completely. Vertical slices are independently deployable and testable, which is essential for CD. Vertical slicing is a core technique in Work Decomposition.

W

WIP (Work in Progress)

The number of work items that have been started but not yet completed. High WIP increases lead time, reduces focus, and increases context-switching overhead. Limiting WIP is a key practice in Phase 3 – Limiting WIP.

Working Agreement

An explicit, documented set of team norms covering how work is defined, reviewed, tested, and deployed. Working agreements create shared expectations and reduce friction. See Working Agreements.

8.2 - CD Dependency Tree

Visual guide showing how CD practices depend on and build upon each other.

Adapted from Dojo Consortium

Continuous delivery is not a single practice you adopt. It is a system of interdependent practices where each one supports and enables others. This dependency tree shows those relationships. Understanding the dependencies helps you plan your migration in the right order – addressing foundational practices before building on them.

The Dependency Tree

The diagram below shows how the core practices of CD relate to each other. Read it from bottom to top: lower practices enable higher ones. The migration phases in this guide are sequenced to follow these dependencies.

graph BT
    subgraph "Goal"
        CD["Continuous Delivery"]
    end

    subgraph "Continuous Integration"
        CI["Continuous Integration"]
    end

    subgraph "Development Practices"
        TBD["Trunk-Based Development"]
        TDD["Test-Driven Development"]
        BDD["Behavior-Driven Development"]
        WD["Work Decomposition"]
        CR["Code Review"]
    end

    subgraph "Build & Test Infrastructure"
        BA["Build Automation"]
        TS["Test Suite"]
        PLEnv["Production-Like Environments"]
    end

    subgraph "Pipeline Practices"
        SPP["Single Path to Production"]
        DP["Deterministic Pipeline"]
        IA["Immutable Artifacts"]
        AC["Application Config"]
        RB["Rollback"]
        DD["Deployable Definition"]
    end

    subgraph "Flow Optimization"
        SB["Small Batches"]
        FF["Feature Flags"]
        WIP["WIP Limits"]
        MDI["Metrics-Driven Improvement"]
    end

    subgraph "Organizational Practices"
        WA["Working Agreements"]
        Retro["Retrospectives"]
        AD["Architecture Decoupling"]
    end

    %% Development Practices feed CI
    TDD --> CI
    BDD --> TDD
    BDD --> WD
    TBD --> CI
    WD --> SB
    CR --> TBD

    %% Build infrastructure feeds CI
    BA --> CI
    TS --> CI
    TDD --> TS

    %% CI feeds pipeline
    CI --> SPP
    CI --> DP
    PLEnv --> DP

    %% Pipeline practices feed CD
    SPP --> CD
    DP --> CD
    IA --> CD
    AC --> IA
    RB --> CD
    DD --> CD

    %% Flow optimization feeds CD
    SB --> CD
    FF --> SB
    FF --> CD
    WIP --> SB
    MDI --> CD

    %% Organizational practices support everything
    WA --> CR
    WA --> DD
    Retro --> MDI
    AD --> FF
    AD --> SB

How to Read the Dependency Tree

Each arrow means “supports” or “enables.” When practice A has an arrow pointing to practice B, it means A is a prerequisite or enabler for B.

Key dependency chains to understand:

BDD enables TDD enables CI enables CD

Behavior-Driven Development produces clear, testable acceptance criteria. Those criteria drive Test-Driven Development at the code level. A comprehensive, fast test suite enables Continuous Integration with confidence. And CI is the foundational prerequisite for CD.

If your team skips BDD, stories are ambiguous. If stories are ambiguous, tests are incomplete or wrong. If tests are unreliable, CI is unreliable. And if CI is unreliable, CD is impossible.

Work Decomposition enables Small Batches enables CD

You cannot deploy small batches if your work items are large. Work decomposition – breaking features into vertical slices that can each be completed in two days or less – is what makes small batches possible. Small batches in turn reduce deployment risk and enable the rapid feedback that CD depends on.

Trunk-Based Development enables CI

CI requires that all developers integrate to a shared trunk at least once per day. If your team uses long-lived feature branches, you are not doing CI regardless of how often your build server runs. TBD is not optional for CD – it is a prerequisite.

Architecture Decoupling enables Feature Flags and Small Batches

Tightly coupled architectures force coordinated deployments. When changing service A requires simultaneously changing services B and C, small independent deployments become impossible. Architecture decoupling – through well-defined APIs, contract testing, and service boundaries – enables teams to deploy independently, use feature flags effectively, and maintain small batch sizes.

Mapping to Migration Phases

The dependency tree directly informs the sequencing of migration phases:

Dependency Layer	Migration Phase	Why This Order
Development practices (TBD, TDD, BDD, work decomposition, code review)	Phase 1 – Foundations	These are prerequisites for CI, which is a prerequisite for everything else
Build and test infrastructure (build automation, test suite, production-like environments)	Phase 1 and Phase 2	You need a reliable build and test infrastructure before you can build a reliable pipeline
Pipeline practices (single path, deterministic pipeline, immutable artifacts, config, rollback)	Phase 2 – Pipeline	The pipeline depends on solid CI and development practices
Flow optimization (small batches, feature flags, WIP limits, metrics)	Phase 3 – Optimize	Optimization requires a working pipeline to optimize
Organizational practices (working agreements, retrospectives, architecture decoupling)	All phases	These cross-cutting practices support every phase and should be established early

Using the Tree to Diagnose Problems

When something in your delivery process is not working, trace it through the dependency tree to find the root cause.

Example 1: Deployments keep failing. Look at what feeds CD in the tree. Is your pipeline deterministic? Are you using immutable artifacts? Is your application config externalized? The failure is likely in one of the pipeline practices.

Example 2: CI builds are constantly broken. Look at what feeds CI. Are developers actually practicing TBD (integrating daily)? Is the test suite reliable, or is it full of flaky tests? Is the build automated end-to-end? The broken builds are a symptom of a problem in the development practices layer.

Example 3: You cannot reduce batch size. Look at what feeds small batches. Is work being decomposed into vertical slices? Are feature flags available so partial work can be deployed safely? Is the architecture decoupled enough to allow independent deployment? The batch size problem originates in one of these upstream practices.

Migration Tip

When you encounter a problem, resist the urge to fix the symptom. Use the dependency tree to trace the problem to its root cause. Fixing the symptom (for example, adding more manual testing to catch deployment failures) will not solve the underlying issue and often adds toil that makes things worse. Fix the dependency that is broken, and the downstream problem resolves itself.

Practices Not Shown

The tree above focuses on the core technical and process practices. Several important supporting practices are not shown for clarity but are covered elsewhere in this guide:

Observability and monitoring – essential for progressive rollout and fast incident response
Security automation – integrated into the pipeline as automated checks rather than manual gates
Database change management – a common constraint addressed during pipeline architecture
Team topology and organizational design – addressed through working agreements and architectural decoupling

8.3 - Common Blockers

Frequently encountered obstacles on the path to CD and how to address them.

Adapted from Dojo Consortium

Every team migrating to continuous delivery will encounter obstacles. Some are technical. Most are not. The blockers listed here are drawn from patterns observed across hundreds of teams attempting the journey to CD. Recognizing them early helps you address root causes rather than fight symptoms.

Work Breakdown Problems

Stories Too Large

What it looks like: User stories regularly take more than a week to complete. Developers work on a single story for days without integrating. Sprint commitments are frequently missed because “the story was bigger than we thought.”

Why it blocks CD: Large stories mean large batches. Large batches mean infrequent integration. Infrequent integration means painful merges, delayed feedback, and high-risk deployments. You cannot practice continuous integration – the prerequisite for CD – if your work items take a week.

What to do: Adopt vertical slicing. Every story should deliver a thin slice of user-visible functionality across all layers of the system. Target a maximum of two days from start to done. See Work Decomposition.

No Vertical Slicing

What it looks like: Stories are organized by technical layer (“build the API,” “build the database schema,” “build the UI”) rather than by user-visible behavior. Multiple stories must be completed before anything is demonstrable or testable end-to-end.

Why it blocks CD: Horizontal slices cannot be independently deployed or tested. They create hard dependencies between stories and teams. Nothing is deployable until all layers are assembled, which forces large-batch releases.

What to do: Rewrite stories as vertical slices that deliver end-to-end functionality, even if the initial slice is minimal. A single form field that saves to the database and displays a confirmation is a vertical slice. An entire database schema with no UI is not.

Team Workflow Problems

Too Much Work in Progress

What it looks like: Every developer is working on a different story. The team has 8 items in progress and 0 items done. Standup meetings are long because everyone has a different context to report on. Nothing is finished, but everything is started.

Why it blocks CD: High WIP destroys flow. When everything is in progress, nothing gets the focused attention needed to finish. Context switching between items adds overhead. The delivery pipeline sees sporadic, large commits rather than a steady stream of small ones.

What to do: Set explicit WIP limits. A team of 6 developers should have no more than 3-4 items in progress at any time. The goal is to finish work, not to start it. See Limiting WIP.

Distant Date Commitments

What it looks like: The team has committed to delivering a specific scope by a date months in the future. The commitment was made before the work was understood. Progress is tracked against the original plan, and “falling behind” triggers pressure to cut corners.

Why it blocks CD: Fixed-scope, fixed-date commitments incentivize large batches. Teams hoard changes until the deadline, then deploy everything at once. There is no incentive to deliver incrementally because the commitment is about the whole scope, not about continuous flow. When the deadline pressure mounts, testing is the first thing cut.

What to do: Shift to continuous delivery of small increments. Report progress by showing working software in production, not by comparing actuals to a Gantt chart. If date commitments are required by the organization, negotiate on scope rather than on quality.

Velocity Used as a Productivity Metric

What it looks like: Management tracks story points completed per sprint as a measure of team productivity. Teams are compared by velocity. There is pressure to increase velocity every sprint.

Why it blocks CD: When velocity is a target, it ceases to be a useful measure (Goodhart’s Law). Teams inflate estimates to look productive. Stories get larger because larger stories have more points. The incentive is to maximize points, not to deliver small, frequent, valuable changes to production.

What to do: Replace velocity with DORA metrics – deployment frequency, lead time, change failure rate, and mean time to restore. These measure delivery outcomes rather than output volume.

Manual Testing Gates

Hardening Sprints

What it looks like: The team allocates one or more sprints after “feature complete” to stabilize, fix bugs, and prepare for release. Code is frozen during hardening. Testers run manual regression suites. Bug counts are tracked on a burndown chart.

Why it blocks CD: A hardening sprint is an admission that the normal development process does not produce deployable software. If you need a dedicated period to make code production-ready, you are not continuously delivering – you are doing waterfall with shorter phases. Hardening sprints add weeks of delay and encourage teams to accumulate technical debt during feature sprints because “we’ll fix it in hardening.”

What to do: Eliminate the need for hardening by building quality in. Adopt TDD to ensure test coverage. Use a CI pipeline that runs the full test suite on every commit. Define “deployable” as an automated pipeline outcome, not as a manual assessment. See Testing Fundamentals and Deployable Definition.

Manual Regression Testing

What it looks like: Every release requires a manual regression test cycle that takes days or weeks. Testers execute scripted test cases against the application. New features are tested manually before they are considered done.

Why it blocks CD: Manual regression testing scales linearly with application size and inversely with delivery frequency. The more features you add, the longer regression takes. The longer regression takes, the less frequently you can deploy. This is the opposite of CD.

What to do: Automate regression tests. Not all at once – start with the highest-risk areas and the tests that block deployments most frequently. Your automated test suite should give you the same confidence as manual regression, but in minutes rather than days. See Testing Fundamentals.

Organizational Anti-Patterns

Meaningless Retrospectives

What it looks like: Retrospectives happen on schedule, but action items are never completed. The same problems surface every sprint. The team has stopped believing that retrospectives lead to change.

Why it blocks CD: CD requires continuous improvement. If the mechanism for identifying and addressing process problems is broken, systemic issues accumulate. The same blockers will persist indefinitely.

What to do: Limit retrospective action items to one or two per sprint and track them as work items with the same visibility as feature work. Make the action items specific and completable. “Improve testing” is not an action item. “Automate the login flow regression test” is. See Retrospectives.

Team Instability

What it looks like: Team members are frequently reassigned to other projects. New people join and leave every few sprints. The team never builds shared context or working agreements.

Why it blocks CD: CD practices depend on team discipline and shared understanding. TBD requires trust between developers. Code review speed depends on familiarity with the codebase. Working agreements require a stable group to establish and maintain. Constantly reshuffling teams means constantly restarting the journey.

What to do: Advocate for stable, long-lived teams. The team should own a product or service for its full lifecycle, not be assembled for a project and disbanded when it ends.

One Delivery per Sprint

What it looks like: The team delivers to production once per sprint, typically at the end. All stories from the sprint are bundled into a single release. The “sprint demo” is the first time stakeholders see working software.

Why it blocks CD: One delivery per sprint is not continuous delivery. It is a two-week batch release with Agile terminology. If something breaks in the batch, any of the changes could be the cause. Rollback means losing the entire sprint’s work. Feedback is delayed by weeks.

What to do: Start deploying individual stories as they are completed, not at the end of the sprint. This requires a working CI pipeline, trunk-based development, and the ability to deploy independently. These are the outcomes of Phase 1 and Phase 2.

Anti-Patterns Summary

The table below maps each common blocker to its root cause and the migration phase that addresses it.

Blocker	Root Cause	Migration Phase
Stories take a week or more	No vertical slicing discipline	Phase 1 – Work Decomposition
Too much WIP	No WIP limits; starting over finishing	Phase 3 – Limiting WIP
Hardening sprints	Quality not built in during development	Phase 1 – Testing Fundamentals
Manual regression testing	Test automation insufficient	Phase 1 – Testing Fundamentals
One delivery per sprint	Batch mindset; no pipeline	Phase 2 – Pipeline
Meaningless retrospectives	No accountability for improvement actions	Phase 3 – Retrospectives
Velocity as productivity metric	Measuring output instead of outcomes	Phase 3 – Metrics-Driven Improvement
Team instability	Organizational project-based staffing	Organizational change (all phases)
Distant date commitments	Fixed-scope commitments made too early	Incremental delivery + stakeholder education
Flaky tests tolerated	Tests not maintained as production code	Phase 1 – Testing Fundamentals
Long-lived feature branches	No TBD practice	Phase 1 – Trunk-Based Development
Manual deployments	No deployment automation	Phase 2 – Single Path to Production

Where to Start

If you recognize many of these blockers in your team, do not try to address them all at once. Use the CD Dependency Tree to understand which practices are prerequisite to others, and use your value stream map to identify which blocker is the current constraint. Fix the biggest constraint first, then move to the next.

8.4 - DORA Capabilities

The capabilities that drive software delivery performance, as identified by DORA research.

Adapted from Dojo Consortium

The DevOps Research and Assessment (DORA) research program has identified capabilities that predict high software delivery performance. These capabilities are not tools or technologies – they are practices and cultural conditions that enable teams to deliver software quickly, reliably, and sustainably.

This page organizes the DORA capabilities by their relevance to each migration phase. Use it as a reference to understand which capabilities you are building at each stage of your journey and which ones to focus on next.

Continuous Delivery Capabilities

These capabilities directly support the mechanics of getting software from commit to production. They are the primary focus of Phases 1 and 2 of the migration.

Version Control

All production artifacts – application code, test code, infrastructure configuration, deployment scripts, and database schemas – are stored in version control and can be reproduced from a single source of truth.

Migration relevance: This is a prerequisite for Phase 1. If any part of your delivery process depends on files stored on a specific person’s machine or a shared drive, address that before beginning the migration.

Continuous Integration

Developers integrate their work to trunk at least daily. Each integration triggers an automated build and test process. Broken builds are fixed within minutes.

Migration relevance: Phase 1 – Foundations. CI is the gateway capability. Without it, none of the pipeline practices in Phase 2 can function. See Build Automation and Trunk-Based Development.

Deployment Automation

Deployments are fully automated and can be triggered by anyone on the team. No manual steps are required between a green pipeline and production.

Migration relevance: Phase 2 – Pipeline. Specifically, Single Path to Production and Rollback.

Trunk-Based Development

Developers work in small batches and merge to trunk at least daily. Branches, if used, are short-lived (less than one day). There are no long-lived feature branches.

Migration relevance: Phase 1 – Trunk-Based Development. This is one of the first capabilities to establish because it enables CI.

Test Automation

A comprehensive suite of automated tests provides confidence that the software is deployable. Tests are reliable, fast, and maintained as carefully as production code.

Migration relevance: Phase 1 – Testing Fundamentals. Also see the Testing reference section for guidance on specific test types.

Test Data Management

Test data is managed in a way that allows automated tests to run independently, repeatably, and without relying on shared mutable state. Tests can create and clean up their own data.

Migration relevance: Becomes critical during Phase 2 when you need production-like environments and deterministic pipeline results.

Shift Left on Security

Security is integrated into the development process rather than added as a gate at the end. Automated security checks run in the pipeline. Security requirements are part of the definition of deployable.

Migration relevance: Integrated during Phase 2 – Pipeline Architecture as automated quality gates rather than manual review steps.

Architecture Capabilities

These capabilities address the structural characteristics of your system that enable or prevent independent, frequent deployment.

Loosely Coupled Architecture

Teams can deploy their services independently without coordinating with other teams. Changes to one service do not require changes to other services. APIs have well-defined contracts.

Migration relevance: Phase 3 – Architecture Decoupling. This capability becomes critical when optimizing for deployment frequency and small batch sizes.

Empowered Teams

Teams choose their own tools, technologies, and approaches within organizational guardrails. They do not need approval from a central architecture board for implementation decisions.

Migration relevance: All phases. Teams that cannot make local decisions about their pipeline, test strategy, or deployment approach will be unable to iterate quickly enough to make progress.

Product and Process Capabilities

These capabilities address how work is planned, prioritized, and delivered.

Customer Feedback

Product decisions are informed by direct feedback from customers. Teams can observe how features are used in production and adjust accordingly.

Migration relevance: Becomes fully enabled in Phase 4 – Deliver on Demand when every change reaches production quickly enough for real customer feedback to inform the next change.

Value Stream Visibility

The team has a clear view of the entire delivery process from request to production, including wait times, handoffs, and rework loops.

Migration relevance: Phase 0 – Value Stream Mapping. This is the first activity in the migration because it informs every decision that follows.

Working in Small Batches

Work is broken down into small increments that can be completed, tested, and deployed independently. Each increment delivers measurable value or validated learning.

Migration relevance: Begins in Phase 1 – Work Decomposition and is optimized in Phase 3 – Small Batches.

Team Experimentation

Teams can try new ideas, tools, and approaches without requiring approval through a lengthy process. Failed experiments are treated as learning, not as waste.

Migration relevance: All phases. The migration itself is an experiment. Teams need the psychological safety and organizational support to try new practices, fail occasionally, and adjust.

Lean Management Capabilities

These capabilities address how work is managed, measured, and improved.

Limit Work in Progress

Teams have explicit WIP limits that constrain the number of items in any stage of the delivery process. WIP limits are enforced and respected.

Migration relevance: Phase 3 – Limiting WIP. Reducing WIP is one of the most effective ways to improve lead time and delivery predictability.

Visual Management

The state of all work is visible to the entire team through dashboards, boards, or other visual tools. Anyone can see what is in progress, what is blocked, and what has been deployed.

Migration relevance: All phases. Visual management supports the identification of constraints in Phase 0 and the enforcement of WIP limits in Phase 3.

Monitoring and Observability

Teams have access to production metrics, logs, and traces that allow them to understand system behavior, detect issues, and diagnose problems quickly.

Migration relevance: Critical for Phase 4 – Progressive Rollout where automated health checks determine whether a deployment proceeds or rolls back. Also supports fast mean time to restore.

Proactive Notification

Teams are alerted to problems before customers are affected. Monitoring thresholds and anomaly detection trigger notifications that enable rapid response.

Migration relevance: Becomes critical in Phase 4 when deployments are continuous and automated. Proactive notification is what makes continuous deployment safe.

Cultural Capabilities

These capabilities address the human and organizational conditions that enable high performance.

Generative Culture

Following Ron Westrum’s organizational typology, a generative culture is characterized by high cooperation, shared risk, and a focus on the mission. Messengers are not punished. Failures are treated as learning opportunities. New ideas are welcomed.

Migration relevance: All phases. A generative culture is not a phase you implement – it is a condition you cultivate continuously. Teams in pathological or bureaucratic cultures will struggle with every phase of the migration because practices like TBD and CI require trust and psychological safety.

Learning Culture

The organization invests in learning. Teams have time for experimentation, training, and conference attendance. Knowledge is shared across teams.

Migration relevance: All phases. The CD migration is a learning journey. Teams need time and space to learn new practices, make mistakes, and improve.

Collaboration Among Teams

Development, operations, security, and product teams work together rather than in silos. Handoffs are minimized. Shared responsibility replaces blame.

Migration relevance: All phases, but especially Phase 2 – Pipeline where the pipeline must encode the quality criteria from all disciplines (security, testing, operations) into automated gates.

Job Satisfaction

Team members find their work meaningful and have the autonomy and resources to do it well. High job satisfaction predicts high delivery performance (the relationship is bidirectional).

Migration relevance: The migration itself should improve job satisfaction by reducing toil, eliminating painful manual processes, and giving teams faster feedback on their work. If the migration is experienced as a burden rather than an improvement, something is wrong with the approach.

Transformational Leadership

Leaders support the migration with vision, resources, and organizational air cover. They remove impediments, set direction, and create the conditions for teams to succeed without micromanaging the details.

Migration relevance: All phases. Without leadership support, the migration will stall when it encounters the first organizational blocker (budget for tools, policy changes for deployment processes, cross-team coordination).

Capability Maturity by Phase

The following table maps each DORA capability to the migration phase where it is most actively developed:

Capability	Phase 0	Phase 1	Phase 2	Phase 3	Phase 4
Version control	Prerequisite
Continuous integration		Primary
Deployment automation			Primary
Trunk-based development		Primary
Test automation		Primary	Expanded
Test data management			Primary
Shift left on security			Primary
Loosely coupled architecture				Primary
Empowered teams	Ongoing	Ongoing	Ongoing	Ongoing	Ongoing
Customer feedback					Primary
Value stream visibility	Primary			Revisited
Working in small batches		Started		Primary
Team experimentation	Ongoing	Ongoing	Ongoing	Ongoing	Ongoing
Limit WIP				Primary
Visual management	Started	Ongoing	Ongoing	Ongoing	Ongoing
Monitoring and observability			Started	Expanded	Primary
Proactive notification					Primary
Generative culture	Ongoing	Ongoing	Ongoing	Ongoing	Ongoing
Learning culture	Ongoing	Ongoing	Ongoing	Ongoing	Ongoing
Collaboration among teams		Started	Primary
Job satisfaction	Ongoing	Ongoing	Ongoing	Ongoing	Ongoing
Transformational leadership	Ongoing	Ongoing	Ongoing	Ongoing	Ongoing

Using This Table

“Primary” means the phase where the capability is the main focus of improvement work. “Ongoing” means the capability is relevant in every phase and should be continuously nurtured. “Started” or “Expanded” means the capability is introduced or deepened in that phase. No entry means the capability is not a primary concern in that phase, though it may still be relevant.

8.5 - Resources

Books, videos, and further reading on continuous delivery and deployment.

Adapted from MinimumCD.org

This page collects the books, websites, and videos that inform the practices in this migration guide. Resources are organized by topic and annotated with which migration phase they are most relevant to.

Books

Continuous Delivery and Deployment

Continuous Delivery Pipelines by Dave Farley: A practical, focused guide to building CD pipelines. Farley covers pipeline design, testing strategies, and deployment patterns in a direct, implementation-oriented style. Start here if you want a concise guide to the pipeline practices in Phase 2.; Most relevant to: Phase 2 – Pipeline
Continuous Delivery by Jez Humble and Dave Farley: The foundational text on CD. Published in 2010, it remains the most comprehensive treatment of the principles and practices that make continuous delivery work. Covers version control patterns, build automation, testing strategies, deployment pipelines, and release management. If you read one book before starting your migration, read this one.; Most relevant to: All phases
Accelerate by Nicole Forsgren, Jez Humble, and Gene Kim: Presents the DORA research findings that link technical practices to organizational performance. Covers the four key metrics (deployment frequency, lead time, change failure rate, MTTR) and the capabilities that predict high performance. Essential reading for anyone who needs to make the business case for a CD migration.; Most relevant to: Phase 0 – Assess and Phase 3 – Metrics-Driven Improvement
Engineering the Digital Transformation by Gary Gruver: Addresses the organizational and leadership challenges of large-scale delivery transformation. Gruver draws on his experience leading transformations at HP and other large enterprises. Particularly valuable for leaders sponsoring a migration who need to understand the change management, communication, and sequencing challenges ahead.; Most relevant to: Organizational leadership across all phases
Release It! by Michael T. Nygard: Covers the design and architecture patterns that make production systems resilient. Topics include stability patterns (circuit breakers, bulkheads, timeouts), deployment patterns, and the operational realities of running software at scale. Essential reading before entering Phase 4, where the team has the capability to deploy any change on demand.; Most relevant to: Phase 4 – Deliver on Demand and Phase 2 – Rollback
The DevOps Handbook by Gene Kim, Jez Humble, Patrick Debois, and John Willis: A practical companion to The Phoenix Project. Covers the Three Ways (flow, feedback, and continuous learning) and provides detailed guidance on implementing DevOps practices. Useful as a reference throughout the migration.; Most relevant to: All phases
The Phoenix Project by Gene Kim, Kevin Behr, and George Spafford: A novel that illustrates DevOps principles through the story of a fictional IT organization in crisis. Useful for building organizational understanding of why delivery improvement matters, especially for stakeholders who will not read a technical book.; Most relevant to: Building organizational buy-in during Phase 0

Testing

Growing Object-Oriented Software, Guided by Tests by Steve Freeman and Nat Pryce: The definitive guide to test-driven development in practice. Goes beyond unit testing to cover acceptance testing, test doubles, and how TDD drives design. Essential reading for Phase 1 testing fundamentals.; Most relevant to: Phase 1 – Testing Fundamentals
Working Effectively with Legacy Code by Michael Feathers: Practical techniques for adding tests to untested code, breaking dependencies, and incrementally improving code that was not designed for testability. Indispensable if your migration starts with a codebase that has little or no automated testing.; Most relevant to: Phase 1 – Testing Fundamentals

Work Decomposition and Flow

User Story Mapping by Jeff Patton: A practical guide to breaking features into deliverable increments using story maps. Patton’s approach directly supports the vertical slicing discipline required for small batch delivery.; Most relevant to: Phase 1 – Work Decomposition
The Principles of Product Development Flow by Donald Reinertsen: A rigorous treatment of flow economics in product development. Covers queue theory, batch size economics, WIP limits, and the cost of delay. Dense but transformative. Reading this book will change how you think about every aspect of your delivery process.; Most relevant to: Phase 3 – Optimize
Making Work Visible by Dominica DeGrandis: Focuses on identifying and eliminating the “time thieves” that steal productivity: too much WIP, unknown dependencies, unplanned work, conflicting priorities, and neglected work. A practical companion to the WIP limiting practices in Phase 3.; Most relevant to: Phase 3 – Limiting WIP

Architecture

Building Microservices by Sam Newman: Covers the architectural patterns that enable independent deployment, including service boundaries, API design, data management, and testing strategies for distributed systems.; Most relevant to: Phase 3 – Architecture Decoupling
Team Topologies by Matthew Skelton and Manuel Pais: Addresses the relationship between team structure and software architecture (Conway’s Law in practice). Covers team types, interaction modes, and how to evolve team structures to support fast flow. Valuable for addressing the organizational blockers that surface throughout the migration.; Most relevant to: Organizational design across all phases

Websites

MinimumCD.org: Defines the minimum set of practices required to claim you are doing continuous delivery. This migration guide uses the MinimumCD definition as its target state. Start here to understand what CD actually requires.
Dojo Consortium: A community-maintained collection of CD practices, metrics definitions, and improvement patterns. Many of the definitions and frameworks in this guide are adapted from the Dojo Consortium’s work.
DORA (dora.dev): The DevOps Research and Assessment site, which publishes the annual State of DevOps report and provides resources for measuring and improving delivery performance.
Trunk-Based Development: The comprehensive reference for trunk-based development patterns. Covers short-lived feature branches, feature flags, branch by abstraction, and release branching strategies.
Martin Fowler’s blog (martinfowler.com): Martin Fowler’s site contains authoritative articles on continuous integration, continuous delivery, microservices, refactoring, and software design. Key articles include “Continuous Integration” and “Continuous Delivery.”
Google Cloud Architecture Center – DevOps: Google’s public documentation of the DORA capabilities, including self-assessment tools and implementation guidance.

Videos

“Continuous Delivery” by Dave Farley (YouTube channel): Dave Farley’s YouTube channel provides weekly videos covering CD practices, pipeline design, testing strategies, and software engineering principles. Accessible and practical.; Most relevant to: All phases
“Continuous Delivery” by Jez Humble (various conference talks): Jez Humble’s conference presentations cover the principles and research behind CD. His talk “Why Continuous Delivery?” is an excellent introduction for teams and stakeholders who are new to the concept.; Most relevant to: Building understanding during Phase 0
“Refactoring” and “TDD” talks by Martin Fowler and Kent Beck: Foundational talks on the development practices that support CD. Understanding TDD and refactoring is essential for Phase 1 testing fundamentals.; Most relevant to: Phase 1 – Foundations
“The Smallest Thing That Could Possibly Work” by Bryan Finster: Covers the work decomposition and small batch delivery practices that are central to this migration guide. Focuses on practical techniques for breaking work into vertical slices.; Most relevant to: Phase 1 – Work Decomposition and Phase 3 – Small Batches

8.6 - Metrics

Detailed definitions for key delivery metrics. Understand what to measure and why.

Adapted from Dojo Consortium

These metrics help you assess your current delivery performance and track improvement over time. Start with the metrics most relevant to your current phase.

Key Metrics

Metric	What It Measures
Integration Frequency	How often code is integrated to trunk
Build Duration	Time from commit to artifact creation
Development Cycle Time	Time from starting work to delivery
Lead Time	Time from request to delivery
Change Fail Rate	Percentage of changes requiring remediation
Mean Time to Repair	Time to restore service after failure
Release Frequency	How often releases reach production
Work in Progress	Amount of started but unfinished work

8.6.1 - Integration Frequency

How often developers integrate code changes to the trunk – a leading indicator of CI maturity and small batch delivery.

Adapted from Dojo Consortium

Definition

Integration Frequency measures the average number of production-ready pull requests a team merges to trunk per day, normalized by team size. On a team of five developers, healthy continuous integration practice produces at least five integrations per day – roughly one per developer.

This metric is a direct indicator of how well a team practices Continuous Integration. Teams that integrate frequently work in small batches, receive fast feedback, and reduce the risk associated with large, infrequent merges.

integrationFrequency = mergedPullRequests / day / numberOfDevelopers

A value of 1.0 or higher per developer per day indicates that work is being decomposed into small, independently deliverable increments.

How to Measure

Count trunk merges. Track the number of pull requests (or direct commits) merged to main or trunk each day.
Normalize by team size. Divide the daily count by the number of developers actively contributing that day.
Calculate the rolling average. Use a 5-day or 10-day rolling window to smooth daily variation and surface meaningful trends.

Most source control platforms expose this data through their APIs:

GitHub – list merged pull requests via the REST or GraphQL API.
GitLab – query merged merge requests per project.
Bitbucket – use the pull request activity endpoint.

Alternatively, count commits to the default branch if pull requests are not used.

Targets

Level	Integration Frequency (per developer per day)
Low	Less than 1 per week
Medium	A few times per week
High	Once per day
Elite	Multiple times per day

The elite target aligns with trunk-based development, where developers push small changes to the trunk multiple times daily and rely on automated testing and feature flags to manage risk.

Common Pitfalls

Meaningless commits. Teams may inflate the count by integrating trivial or empty changes. Pair this metric with code review quality and defect rate.
Breaking the trunk. Pushing faster without adequate test coverage leads to a red build and slows the entire team. Always pair Integration Frequency with build success rate and Change Fail Rate.
Counting the wrong thing. Merges to long-lived feature branches do not count. Only merges to the trunk or main integration branch reflect true CI practice.
Ignoring quality. If defect rates rise as integration frequency increases, the team is skipping quality steps. Use defect rate as a guardrail metric.

Connection to CD

Integration Frequency is the foundational metric for Continuous Delivery. Without frequent integration, every downstream metric suffers:

Smaller batches reduce risk. Each integration carries less change, making failures easier to diagnose and fix.
Faster feedback loops. Frequent integration means the CI pipeline runs more often, catching issues within minutes instead of days.
Enables trunk-based development. High integration frequency is incompatible with long-lived branches. Teams naturally move toward short-lived branches or direct trunk commits.
Reduces merge conflicts. The longer code stays on a branch, the more likely it diverges from trunk. Frequent integration keeps the delta small.
Prerequisite for deployment frequency. You cannot deploy more often than you integrate. Improving this metric directly unblocks improvements to Release Frequency.

To improve Integration Frequency:

Decompose stories into smaller increments using Behavior-Driven Development.
Use Test-Driven Development to produce modular, independently testable code.
Adopt feature flags or branch by abstraction to decouple integration from release.
Practice Trunk-Based Development with short-lived branches lasting less than one day.

8.6.2 - Build Duration

Time from code commit to a deployable artifact – a critical constraint on feedback speed and mean time to repair.

Adapted from Dojo Consortium

Definition

Build Duration measures the elapsed time from when a developer pushes a commit until the CI pipeline produces a deployable artifact and all automated quality gates have passed. This includes compilation, unit tests, integration tests, static analysis, security scans, and artifact packaging.

Build Duration represents the minimum possible time between deciding to make a change and having that change ready for production. It sets a hard floor on Lead Time and directly constrains how quickly a team can respond to production incidents.

buildDuration = artifactReadyTimestamp - commitPushTimestamp

This metric is sometimes referred to as “pipeline cycle time” or “CI cycle time.” The book Accelerate references it as part of “hard lead time.”

How to Measure

Record the commit timestamp. Capture when the commit arrives at the CI server (webhook receipt or pipeline trigger time).
Record the artifact-ready timestamp. Capture when the final pipeline stage completes successfully and the deployable artifact is published.
Calculate the difference. Subtract the commit timestamp from the artifact-ready timestamp.
Track the median and p95. The median shows typical performance. The 95th percentile reveals worst-case builds that block developers.

Most CI platforms expose build duration natively:

GitHub Actions – createdAt and updatedAt on workflow runs.
GitLab CI – pipeline created_at and finished_at.
Jenkins – build start time and duration fields.
CircleCI – workflow duration in the Insights dashboard.

Set up alerts when builds exceed your target threshold so the team can investigate regressions immediately.

Targets

Level	Build Duration
Low	More than 30 minutes
Medium	10 – 30 minutes
High	5 – 10 minutes
Elite	Less than 5 minutes

The ten-minute threshold is a widely recognized guideline. Builds longer than ten minutes break developer flow, discourage frequent integration, and increase the cost of fixing failures.

Common Pitfalls

Removing tests to hit targets. Reducing test count or skipping test types (integration, security) lowers build duration but degrades quality. Always pair this metric with Change Fail Rate and defect rate.
Ignoring queue time. If builds wait in a queue before execution, the developer experiences the queue time as part of the feedback delay even though it is not technically “build” time. Measure wall-clock time from commit to result.
Optimizing the wrong stage. Profile the pipeline before optimizing. Often a single slow test suite or a sequential step that could run in parallel dominates the total duration.
Flaky tests. Tests that intermittently fail cause retries, effectively doubling or tripling build duration. Track flake rate alongside build duration.

Connection to CD

Build Duration is a critical bottleneck in the Continuous Delivery pipeline:

Constrains Mean Time to Repair. When production is down, the build pipeline is the minimum time to get a fix deployed. A 30-minute build means at least 30 minutes of downtime for any fix, no matter how small. Reducing build duration directly improves MTTR.
Enables frequent integration. Developers are unlikely to integrate multiple times per day if each integration takes 30 minutes to validate. Short builds encourage higher Integration Frequency.
Shortens feedback loops. The sooner a developer learns that a change broke something, the less context they have lost and the cheaper the fix. Builds under ten minutes keep developers in flow.
Supports continuous deployment. Automated deployment pipelines cannot deliver changes rapidly if the build stage is slow. Build duration is often the largest component of Lead Time.

To improve Build Duration:

Parallelize stages. Run unit tests, linting, and security scans concurrently rather than sequentially.
Replace slow end-to-end tests. Move heavyweight end-to-end tests to an asynchronous post-deploy verification stage. Use contract tests and service virtualization in the main pipeline.
Decompose large services. Smaller codebases compile and test faster. If build duration is stubbornly high, consider breaking the service into smaller domains.
Cache aggressively. Cache dependencies, Docker layers, and compilation artifacts between builds.
Set a build time budget. Alert the team whenever a new test or step pushes the build past your target, so test efficiency is continuously maintained.

8.6.3 - Development Cycle Time

Average time from when work starts until it is running in production – a key flow metric for identifying delivery bottlenecks.

Adapted from Dojo Consortium

Definition

Development Cycle Time measures the elapsed time from when a developer begins work on a story or task until that work is deployed to production and available to users. It captures the full construction phase of delivery: coding, code review, testing, integration, and deployment.

developmentCycleTime = productionDeployTimestamp - workStartedTimestamp

This is distinct from Lead Time, which includes the time a request spends waiting in the backlog before work begins. Development Cycle Time focuses exclusively on the active delivery phase.

The Accelerate research uses “lead time for changes” (measured from commit to production) as a key DORA metric. Development Cycle Time extends this slightly further back to when work starts, capturing the full development process including any time between starting work and the first commit.

How to Measure

Record when work starts. Capture the timestamp when a story moves to “In Progress” in your issue tracker, or when the first commit for the story appears.
Record when work reaches production. Capture the timestamp of the production deployment that includes the completed story.
Calculate the difference. Subtract the start time from the production deploy time.
Report the median and distribution. The median provides a typical value. The distribution (or a control chart) reveals variability and outliers that indicate process problems.

Sources for this data include:

Issue trackers (Jira, GitHub Issues, Azure Boards) – status transition timestamps.
Source control – first commit timestamp associated with a story.
Deployment logs – timestamp of production deployments linked to stories.

Linking stories to deployments is essential. Use commit message conventions (e.g., story IDs in commit messages) or deployment metadata to create this connection.

Targets

Level	Development Cycle Time
Low	More than 2 weeks
Medium	1 – 2 weeks
High	2 – 7 days
Elite	Less than 2 days

Elite teams deliver completed work to production within one to two days of starting it. This is achievable only when work is decomposed into small increments, the pipeline is fast, and deployment is automated.

Common Pitfalls

Marking work “Done” before it reaches production. If “Done” means “code complete” rather than “deployed,” the metric understates actual cycle time. The Definition of Done must include production deployment.
Skipping the backlog. Moving items from “Backlog” directly to “Done” after deploying hides the true wait time and development duration. Ensure stories pass through the standard workflow stages.
Splitting work into functional tasks. Breaking a story into separate “development,” “testing,” and “deployment” tasks obscures the end-to-end cycle time. Measure at the story or feature level.
Ignoring variability. A low average can hide a bimodal distribution where some stories take hours and others take weeks. Use a control chart or histogram to expose the full picture.
Optimizing for speed without quality. If cycle time drops but Change Fail Rate rises, the team is cutting corners. Use quality metrics as guardrails.

Connection to CD

Development Cycle Time is the most comprehensive measure of delivery flow and sits at the heart of Continuous Delivery:

Exposes bottlenecks. A long cycle time reveals where work gets stuck – waiting for code review, queued for testing, blocked by a manual approval, or delayed by a slow pipeline. Each bottleneck is a target for improvement.
Drives smaller batches. The only way to achieve a cycle time under two days is to decompose work into very small increments. This naturally leads to smaller changes, less risk, and faster feedback.
Reduces waste from changing priorities. Long cycle times mean work in progress is exposed to priority changes, context switches, and scope creep. Shorter cycles reduce the window of vulnerability.
Improves feedback quality. The sooner a change reaches production, the sooner the team gets real user feedback. Short cycle times enable rapid learning and course correction.
Subsumes other metrics. Cycle time is affected by Integration Frequency, Build Duration, and Work in Progress. Improving any of these upstream metrics will reduce cycle time.

To improve Development Cycle Time:

Decompose work into stories that can be completed and deployed within one to two days.
Remove handoffs between teams (e.g., separate dev and QA teams).
Automate the build and deploy pipeline to eliminate manual steps.
Improve test design so the pipeline runs faster without sacrificing coverage.
Limit Work in Progress so the team focuses on finishing work rather than starting new items.

8.6.4 - Lead Time

Total time from when a change is committed until it is running in production – a DORA key metric for delivery throughput.

Adapted from Dojo Consortium

Definition

Lead Time measures the total elapsed time from when a code change is committed to the version control system until that change is successfully running in production. This is one of the four key metrics identified by the DORA (DevOps Research and Assessment) team as a predictor of software delivery performance.

leadTime = productionDeployTimestamp - commitTimestamp

In the broader value stream, “lead time” can also refer to the time from a customer request to delivery. The DORA definition focuses specifically on the segment from commit to production, which the Accelerate research calls “lead time for changes.” This narrower definition captures the efficiency of your delivery pipeline and deployment process.

Lead Time includes Build Duration plus any additional time for deployment, approval gates, environment provisioning, and post-deploy verification. It is a superset of build time and a subset of Development Cycle Time, which also includes the coding phase before the first commit.

How to Measure

Record the commit timestamp. Use the timestamp of the commit as recorded in source control (not the local author timestamp, but the time it was pushed or merged to the trunk).
Record the production deployment timestamp. Capture when the deployment containing that commit completes successfully in production.
Calculate the difference. Subtract the commit time from the deploy time.
Aggregate across commits. Report the median lead time across all commits deployed in a given period (daily, weekly, or per release).

Data sources:

Source control – commit or merge timestamps from Git, GitHub, GitLab, etc.
CI/CD platform – pipeline completion times from Jenkins, GitHub Actions, GitLab CI, etc.
Deployment tooling – production deployment timestamps from Argo CD, Spinnaker, Flux, or custom scripts.

For teams practicing continuous deployment, lead time may be nearly identical to build duration. For teams with manual approval gates or scheduled release windows, lead time will be significantly longer.

Targets

Level	Lead Time for Changes
Low	More than 6 months
Medium	1 – 6 months
High	1 day – 1 week
Elite	Less than 1 hour

These levels are drawn from the DORA State of DevOps research. Elite performers deliver changes to production in under an hour from commit, enabled by fully automated pipelines and continuous deployment.

Common Pitfalls

Measuring only build time. Lead time includes everything after the commit, not just the CI pipeline. Manual approval gates, scheduled deployment windows, and environment provisioning delays must all be included.
Ignoring waiting time. A change may sit in a queue waiting for a release train, a change advisory board (CAB) review, or a deployment window. This wait time is part of lead time and often dominates the total.
Tracking requests instead of commits. Some teams measure from customer request to delivery. While valuable, this conflates backlog prioritization with delivery efficiency. Keep this metric focused on the commit-to-production segment.
Hiding items from the backlog. Requests tracked in spreadsheets or side channels before entering the backlog distort lead time measurements. Ensure all work enters the system of record promptly.
Reducing quality to reduce lead time. Shortening approval processes or skipping test stages reduces lead time at the cost of quality. Pair this metric with Change Fail Rate as a guardrail.

Connection to CD

Lead Time is one of the four DORA metrics and a direct measure of your delivery pipeline’s end-to-end efficiency:

Reveals pipeline bottlenecks. A large gap between build duration and lead time points to manual processes, approval queues, or deployment delays that the team can target for automation.
Measures the cost of failure recovery. When production breaks, lead time is the minimum time to deliver a fix (unless you roll back). This makes lead time a direct input to Mean Time to Repair.
Drives automation. The primary way to reduce lead time is to automate every step between commit and production: build, test, security scanning, environment provisioning, deployment, and verification.
Reflects deployment strategy. Teams using continuous deployment have lead times measured in minutes. Teams using weekly release trains have lead times measured in days. The metric makes the cost of batching visible.
Connects speed and stability. The DORA research shows that elite performers achieve both low lead time and low Change Fail Rate. Speed and quality are not trade-offs – they reinforce each other when the delivery system is well-designed.

To improve Lead Time:

Automate the deployment pipeline end to end, eliminating manual gates.
Replace change advisory board (CAB) reviews with automated policy checks and peer review.
Deploy on every successful build rather than batching changes into release trains.
Reduce Build Duration to shrink the largest component of lead time.
Monitor and eliminate environment provisioning delays.

8.6.5 - Change Fail Rate

Percentage of production deployments that cause a failure or require remediation – a DORA key metric for delivery stability.

Adapted from Dojo Consortium

Definition

Change Fail Rate measures the percentage of deployments to production that result in degraded service, negative customer impact, or require immediate remediation such as a rollback, hotfix, or patch.

changeFailRate = failedChangeCount / totalChangeCount * 100

A “failed change” includes any deployment that:

Is rolled back.
Requires a hotfix deployed within a short window (commonly 24 hours).
Triggers a production incident attributed to the change.
Requires manual intervention to restore service.

This is one of the four DORA key metrics. It measures the stability side of delivery performance, complementing the throughput metrics of Lead Time and Release Frequency.

How to Measure

Count total production deployments over a defined period (weekly, monthly).
Count deployments classified as failures using the criteria above.
Divide failures by total deployments and express as a percentage.

Data sources:

Deployment logs – total deployment count from your CD platform.
Incident management – incidents linked to specific deployments (PagerDuty, Opsgenie, ServiceNow).
Rollback records – deployments that were reverted, either manually or by automated rollback.
Hotfix tracking – deployments tagged as hotfixes or emergency changes.

Automate the classification where possible. For example, if a deployment is followed by another deployment of the same service within a defined window (e.g., one hour), flag the original as a potential failure for review.

Targets

Level	Change Fail Rate
Low	46 – 60%
Medium	16 – 45%
High	0 – 15%
Elite	0 – 5%

These levels are drawn from the DORA State of DevOps research. Elite performers maintain a change fail rate below 5%, meaning fewer than 1 in 20 deployments causes a problem.

Common Pitfalls

Not recording failures. Deploying fixes without logging the original failure understates the true rate. Ensure every incident and rollback is tracked.
Reclassifying defects. Creating review processes that reclassify production defects as “feature requests” or “known limitations” hides real failures.
Inflating deployment count. Re-deploying the same working version to increase the denominator artificially lowers the rate. Only count deployments that contain new changes.
Pursuing zero defects at the cost of speed. An obsessive focus on eliminating all failures can slow Release Frequency to a crawl. A small failure rate with fast recovery is preferable to near-zero failures with monthly deployments.
Ignoring near-misses. Changes that cause degraded performance but do not trigger a full incident are still failures. Define clear criteria for what constitutes a failed change and apply them consistently.

Connection to CD

Change Fail Rate is the primary quality signal in a Continuous Delivery pipeline:

Validates pipeline quality gates. A rising change fail rate indicates that the automated tests, security scans, and quality checks in the pipeline are not catching enough defects. Each failure is an opportunity to add or improve a quality gate.
Enables confidence in frequent releases. Teams will only deploy frequently if they trust the pipeline. A low change fail rate builds this trust and supports higher Release Frequency.
Smaller changes fail less. The DORA research consistently shows that smaller, more frequent deployments have lower failure rates than large, infrequent releases. Improving Integration Frequency naturally improves this metric.
Drives root cause analysis. Each failed change should trigger a blameless investigation: what automated check could have caught this? The answers feed directly into pipeline improvements.
Balances throughput metrics. Change Fail Rate is the essential guardrail for Lead Time and Release Frequency. If those metrics improve while change fail rate worsens, the team is trading quality for speed.

To improve Change Fail Rate:

Deploy smaller changes more frequently to reduce the blast radius of failures.
Identify the root cause of each failure and add automated checks to prevent recurrence.
Strengthen the test suite, particularly integration and contract tests that validate interactions between services.
Implement progressive delivery (canary releases, feature flags) to limit the impact of defective changes before they reach all users.
Conduct blameless post-incident reviews and feed learnings back into the delivery pipeline.

8.6.6 - Mean Time to Repair

Average time from when a production incident is detected until service is restored – a DORA key metric for recovery capability.

Adapted from Dojo Consortium

Definition

Mean Time to Repair (MTTR) measures the average elapsed time between when a production incident is detected and when it is fully resolved and service is restored to normal operation.

mttr = sum(resolvedTimestamp - detectedTimestamp) / incidentCount

MTTR reflects an organization’s ability to recover from failure. It encompasses detection, diagnosis, fix development, build, deployment, and verification. A short MTTR depends on the entire delivery system working well – fast builds, automated deployments, good observability, and practiced incident response.

The Accelerate research identifies MTTR as one of the four key DORA metrics and notes that “software delivery performance is a combination of lead time, release frequency, and MTTR.” It is the stability counterpart to the throughput metrics.

How to Measure

Record the detection timestamp. This is when the team first becomes aware of the incident – typically when an alert fires, a customer reports an issue, or monitoring detects an anomaly.
Record the resolution timestamp. This is when the incident is resolved and service is confirmed to be operating normally. Resolution means the customer impact has ended, not merely that a fix has been deployed.
Calculate the duration for each incident.
Compute the average across all incidents in a given period.

Data sources:

Incident management platforms – PagerDuty, Opsgenie, ServiceNow, or Statuspage provide incident lifecycle timestamps.
Monitoring and alerting – alert trigger times from Datadog, Prometheus Alertmanager, CloudWatch, or equivalent.
Deployment logs – timestamps of rollbacks or hotfix deployments.

Report both the mean and the median. The mean can be skewed by a single long outage, so the median gives a better sense of typical recovery time. Also track the maximum MTTR per period to highlight worst-case incidents.

Targets

Level	Mean Time to Repair
Low	More than 1 week
Medium	1 day – 1 week
High	Less than 1 day
Elite	Less than 1 hour

Elite performers restore service in under one hour. This requires automated rollback or roll-forward capability, fast build pipelines, and well-practiced incident response processes.

Common Pitfalls

Closing incidents prematurely. Marking an incident as resolved before the customer impact has actually ended artificially deflates MTTR. Define “resolved” clearly and verify that service is truly restored.
Not counting detection time. If the team discovers a problem informally (e.g., a developer notices something odd) and fixes it before opening an incident, the time is not captured. Encourage consistent incident reporting.
Ignoring recurring incidents. If the same issue keeps reappearing, each individual MTTR may be short, but the cumulative impact is high. Track recurrence as a separate quality signal.
Conflating MTTR with MTTD. Mean Time to Detect (MTTD) and Mean Time to Repair overlap but are distinct. If you only measure from alert to resolution, you miss the detection gap – the time between when the problem starts and when it is detected. Both matter.
Optimizing MTTR without addressing root causes. Getting faster at fixing recurring problems is good, but preventing those problems in the first place is better. Pair MTTR with Change Fail Rate to ensure the number of incidents is also decreasing.

Connection to CD

MTTR is a direct measure of how well the entire Continuous Delivery system supports recovery:

Pipeline speed is the floor. The minimum possible MTTR for a roll-forward fix is the Build Duration plus deployment time. A 30-minute build means you cannot restore service via a code fix in less than 30 minutes. Reducing build duration directly reduces MTTR.
Automated deployment enables fast recovery. Teams that can deploy with one click or automatically can roll back or roll forward in minutes. Manual deployment processes add significant time to every incident.
Feature flags accelerate mitigation. If a failing change is behind a feature flag, the team can disable it in seconds without deploying new code. This can reduce MTTR from minutes to seconds for flag-protected changes.
Observability shortens detection and diagnosis. Good logging, metrics, and tracing help the team identify the cause of an incident quickly. Without observability, diagnosis dominates the repair timeline.
Practice improves performance. Teams that deploy frequently have more experience responding to issues. High Release Frequency correlates with lower MTTR because the team has well-rehearsed recovery procedures.
Trunk-based development simplifies rollback. When trunk is always deployable, the team can roll back to the previous commit. Long-lived branches and complex merge histories make rollback risky and slow.

To improve MTTR:

Keep the pipeline always deployable so a fix can be deployed at any time.
Reduce Build Duration to enable faster roll-forward.
Implement feature flags for large changes so they can be disabled without redeployment.
Invest in observability – structured logging, distributed tracing, and meaningful alerting.
Practice incident response regularly, including deploying rollbacks and hotfixes.
Conduct blameless post-incident reviews and feed learnings back into the pipeline and monitoring.

8.6.7 - Release Frequency

How often changes are deployed to production – a DORA key metric for delivery throughput and team capability.

Adapted from Dojo Consortium

Definition

Release Frequency (also called Deployment Frequency) measures how often a team successfully deploys changes to production. It is expressed as deployments per day, per week, or per month, depending on the team’s current cadence.

releaseFrequency = productionDeployments / timePeriod

This is one of the four DORA key metrics. It measures the throughput side of delivery performance – how rapidly the team can get completed work into the hands of users. Higher release frequency enables faster feedback, smaller batch sizes, and reduced deployment risk.

Each deployment should deliver a meaningful change. Re-deploying the same artifact or deploying empty changes does not count.

How to Measure

Count production deployments. Record each successful deployment to the production environment over a defined period.
Exclude non-changes. Do not count re-deployments of unchanged artifacts, infrastructure-only changes (unless relevant), or deployments to non-production environments.
Calculate frequency. Divide the count by the time period. Express as deployments per day (for high performers) or per week/month (for teams earlier in their journey).

Data sources:

CD platforms – Argo CD, Spinnaker, Flux, Octopus Deploy, or similar tools track every deployment.
CI/CD pipeline logs – GitHub Actions, GitLab CI, Jenkins, and CircleCI record deployment job executions.
Cloud provider logs – AWS CodeDeploy, Azure DevOps, GCP Cloud Deploy, and Kubernetes audit logs.
Custom deployment scripts – Add a logging line that records the timestamp, service name, and version to a central log or metrics system.

Targets

Level	Release Frequency
Low	Less than once per 6 months
Medium	Once per month to once per 6 months
High	Once per week to once per month
Elite	Multiple times per day

These levels are drawn from the DORA State of DevOps research. Elite performers deploy on demand, multiple times per day, with each deployment containing a small set of changes.

Common Pitfalls

Counting empty deployments. Re-deploying the same artifact or building artifacts that contain no changes inflates the metric without delivering value. Count only deployments with meaningful changes.
Ignoring failed deployments. If you count deployments that are immediately rolled back, the frequency looks good but the quality is poor. Pair with Change Fail Rate to get the full picture.
Equating frequency with value. Deploying frequently is a means, not an end. Deploying 10 times a day delivers no value if the changes do not meet user needs. Release Frequency measures capability, not outcome.
Batch releasing to hit a target. Combining multiple changes into a single release to deploy “more often” defeats the purpose. The goal is small, individual changes flowing through the pipeline independently.
Focusing on speed without quality. If release frequency increases but Change Fail Rate also increases, the team is releasing faster than its quality processes can support. Slow down and improve the pipeline.

Connection to CD

Release Frequency is the ultimate output metric of a Continuous Delivery pipeline:

Validates the entire delivery system. High release frequency is only possible when the pipeline is fast, tests are reliable, deployment is automated, and the team has confidence in the process. It is the end-to-end proof that CD is working.
Reduces deployment risk. Each deployment carries less change when deployments are frequent. Less change means less risk, easier rollback, and simpler debugging when something goes wrong.
Enables rapid feedback. Frequent releases get features and fixes in front of users sooner. This shortens the feedback loop and allows the team to course-correct before investing heavily in the wrong direction.
Exercises recovery capability. Teams that deploy frequently practice the deployment process daily. When a production incident occurs, the deployment process is well-rehearsed and reliable, directly improving Mean Time to Repair.
Decouples deploy from release. At high frequency, teams separate the act of deploying code from the act of enabling features for users. Feature flags, progressive delivery, and dark launches become standard practice.

To improve Release Frequency:

Reduce Development Cycle Time by decomposing work into smaller increments.
Remove manual handoffs to other teams (e.g., ops, QA, change management).
Automate every step of the deployment process, from build through production verification.
Replace manual change approval boards with automated policy checks and peer review.
Convert hard dependencies on other teams or services into soft dependencies using feature flags and service virtualization.
Adopt Trunk-Based Development so that trunk is always in a deployable state.

8.6.8 - Work in Progress

Number of work items started but not yet completed – a leading indicator of flow problems, context switching, and delivery delays.

Adapted from Dojo Consortium

Definition

Work in Progress (WIP) is the total count of work items that have been started but not yet completed and delivered to production. This includes all types of work: stories, defects, tasks, spikes, and any other items that a team member has begun but not finished.

wip = countOf(items where status is between "started" and "done")

WIP is a leading indicator from Lean manufacturing. Unlike trailing metrics such as Development Cycle Time or Lead Time, WIP tells you about problems that are happening right now. High WIP predicts future delivery delays, increased cycle time, and lower quality.

Little’s Law provides the mathematical relationship:

cycleTime = wip / throughput

If throughput (the rate at which items are completed) stays constant, increasing WIP directly increases cycle time. The only way to reduce cycle time without working faster is to reduce WIP.

How to Measure

Count all in-progress items. At a regular cadence (daily or at each standup), count the number of items in any active state on your team’s board. Include everything between “To Do” and “Done.”
Normalize by team size. Divide WIP by the number of team members to get a per-person ratio. This makes the metric comparable across teams of different sizes.
Track over time. Record the WIP count daily and observe trends. A rising WIP count is an early warning of delivery problems.

Data sources:

Kanban boards – Jira, Azure Boards, Trello, GitHub Projects, or physical boards. Count cards in any column between the backlog and done.
Issue trackers – Query for items with an “In Progress,” “In Review,” “In QA,” or equivalent active status.
Manual count – At standup, ask: “How many things are we actively working on right now?”

The simplest and most effective approach is to make WIP visible by keeping the team board up to date and counting active items daily.

Targets

Level	WIP per Team
Low	More than 2x team size
Medium	Between 1x and 2x team size
High	Equal to team size
Elite	Less than team size (ideally half)

The guiding principle is that WIP should never exceed team size. A team of five should have at most five items in progress at any time. Elite teams often work in pairs, bringing WIP to roughly half the team size.

Common Pitfalls

Hiding work. Not moving items to “In Progress” when working on them keeps WIP artificially low. The board must reflect reality. If someone is working on it, it should be visible.
Marking items done prematurely. Moving items to “Done” before they are deployed to production understates WIP. The Definition of Done must include production deployment.
Creating micro-tasks. Splitting a single story into many small tasks (development, testing, code review, deployment) and tracking each separately inflates the item count without changing the actual work. Measure WIP at the story or feature level.
Ignoring unplanned work. Production support, urgent requests, and interruptions consume capacity but are often not tracked on the board. If the team is spending time on it, it is WIP and should be visible.
Setting WIP limits but not enforcing them. WIP limits only work if the team actually stops starting new work when the limit is reached. Treat WIP limits as a hard constraint, not a suggestion.

Connection to CD

WIP is the most actionable flow metric and directly impacts every aspect of Continuous Delivery:

Predicts cycle time. Per Little’s Law, WIP and cycle time are directly proportional. Reducing WIP is the fastest way to reduce Development Cycle Time without changing anything else about the delivery process.
Reduces context switching. When developers juggle multiple items, they lose time switching between contexts. Research consistently shows that each additional item in progress reduces effective productivity. Low WIP means more focus and faster completion.
Exposes blockers. When WIP limits are in place and an item gets blocked, the team cannot simply start something new. They must resolve the blocker first. This forces the team to address systemic problems rather than working around them.
Enables continuous flow. CD depends on a steady flow of small changes moving through the pipeline. High WIP creates irregular, bursty delivery. Low WIP creates smooth, predictable flow.
Improves quality. When teams focus on fewer items, each item gets more attention. Code reviews happen faster, testing is more thorough, and defects are caught sooner. This naturally reduces Change Fail Rate.
Supports trunk-based development. High WIP often correlates with many long-lived branches. Reducing WIP encourages developers to complete and integrate work before starting something new, which aligns with Integration Frequency goals.

To reduce WIP:

Set explicit WIP limits for the team and enforce them. Start with a limit equal to team size and reduce it over time.
Prioritize finishing work over starting new work. At standup, ask “What can I help finish?” before “What should I start?”
Prioritize code review and pairing to unblock teammates over picking up new items.
Make the board visible and accurate. Use it as the single source of truth for what the team is working on.
Identify and address recurring blockers that cause items to stall in progress.

8.7 - Testing

Testing types, patterns, and best practices for building confidence in your delivery pipeline.

Adapted from Dojo Consortium

A reliable test suite is essential for continuous delivery. These pages cover the different types of tests, when to use each, and best practices for test architecture.

Test Types

Type	Purpose
Unit Tests	Verify individual components in isolation
Integration Tests	Verify components work together
Functional Tests	Verify user-facing behavior
End-to-End Tests	Verify complete user workflows
Contract Tests	Verify API contracts between services
Static Analysis	Catch issues without running code
Test Doubles	Patterns for isolating dependencies in tests

8.7.1 - Unit Tests

Fast, deterministic tests that verify individual functions, methods, or components in isolation with test doubles for dependencies.

Adapted from Dojo Consortium

Definition

A unit test is a deterministic test that exercises a discrete unit of the application – such as a function, method, or UI component – in isolation to determine whether it behaves as expected. All external dependencies are replaced with test doubles so the test runs quickly and produces the same result every time.

When testing the behavior of functions, prefer testing public APIs (methods, interfaces, exported functions) over private internals. Testing private implementation details creates change-detector tests that break during routine refactoring without adding safety.

The purpose of unit tests is to:

Verify the functionality of a single unit (method, class, function) in isolation.
Cover high-complexity logic where many input permutations exist, such as business rules, calculations, and state transitions.
Keep cyclomatic complexity visible and manageable through good separation of concerns.

When to Use

During development – run the relevant subset of unit tests continuously while writing code. TDD (Red-Green-Refactor) is the most effective workflow.
On every commit – use pre-commit hooks or watch-mode test runners so broken tests never reach the remote repository.
In CI – execute the full unit test suite on every pull request and on the trunk after merge to verify nothing was missed locally.

Unit tests are the right choice when the behavior under test can be exercised without network access, file system access, or database connections. If you need any of those, you likely need an integration test or a functional test instead.

Characteristics

Property	Value
Speed	Milliseconds per test
Determinism	Always deterministic
Scope	Single function, method, or component
Dependencies	All replaced with test doubles
Network	None
Database	None
Breaks build	Yes

Examples

A JavaScript unit test verifying a pure utility function:

// castArray.test.js
describe("castArray", () => {
  it("should wrap non-array items in an array", () => {
    expect(castArray(1)).toEqual([1]);
    expect(castArray("a")).toEqual(["a"]);
    expect(castArray({ a: 1 })).toEqual([{ a: 1 }]);
  });

  it("should return array values by reference", () => {
    const array = [1];
    expect(castArray(array)).toBe(array);
  });

  it("should return an empty array when no arguments are given", () => {
    expect(castArray()).toEqual([]);
  });
});

A Java unit test using Mockito to isolate the system under test:

@Test
public void shouldReturnUserDetails() {
    // Arrange
    User mockUser = new User("Ada", "Engineering");
    when(userService.getUserInfo("u123")).thenReturn(mockUser);

    // Act
    User result = userController.getUser("u123");

    // Assert
    assertEquals("Ada", result.getName());
    assertEquals("Engineering", result.getDepartment());
}

Anti-Patterns

Testing private methods – private implementations are meant to change. Test the public interface that calls them instead.
No assertions – a test that runs code without asserting anything provides false confidence. Lint rules like jest/expect-expect can catch this.
Disabling or skipping tests – skipped tests erode confidence over time. Fix or remove them.
Testing implementation details – asserting on internal state or call order rather than observable output creates brittle tests that break during refactoring.
Ice cream cone testing – relying primarily on slow E2E tests while neglecting fast unit tests inverts the test pyramid and slows feedback.
Chasing coverage numbers – gaming coverage metrics (e.g., running code paths without meaningful assertions) creates a false sense of confidence. Focus on use-case coverage instead.

Connection to CD Pipeline

Unit tests occupy the base of the test pyramid. They run in the earliest stages of the CI/CD pipeline and provide the fastest feedback loop:

Local development – watch mode reruns tests on every save.
Pre-commit – hooks run the suite before code reaches version control.
PR verification – CI runs the full suite and blocks merge on failure.
Trunk verification – CI reruns tests on the merged HEAD to catch integration issues.

Because unit tests are fast and deterministic, they should always break the build on failure. A healthy CD pipeline depends on a large, reliable unit test suite that gives developers confidence to ship small changes frequently.

8.7.2 - Integration Tests

Deterministic tests that verify how units interact together or with external system boundaries using test doubles for non-deterministic dependencies.

Adapted from Dojo Consortium

Definition

An integration test is a deterministic test that verifies how the unit under test interacts with other units without directly accessing external sub-systems. It may validate multiple units working together (sometimes called a “sociable unit test”) or the portion of the code that interfaces with an external network dependency while using a test double to represent that dependency.

For clarity: an “integration test” is not a test that broadly integrates multiple sub-systems. That is an end-to-end test.

When to Use

Integration tests provide the best balance of speed, confidence, and cost. Use them when:

You need to verify that multiple units collaborate correctly – for example, a service calling a repository that calls a data mapper.
You need to validate the interface layer to an external system (HTTP client, message producer, database query) while keeping the external system replaced by a test double.
You want to confirm that a refactoring did not break behavior. Integration tests that avoid testing implementation details survive refactors without modification.
You are building a front-end component that composes child components and needs to verify the assembled behavior from the user’s perspective.

If the test requires a live network call to a system outside localhost, it is either a contract test or an E2E test.

Characteristics

Property	Value
Speed	Milliseconds to low seconds
Determinism	Always deterministic
Scope	Multiple units or a unit plus its boundary
Dependencies	External systems replaced with test doubles
Network	Localhost only
Database	Localhost / in-memory only
Breaks build	Yes

Examples

A JavaScript integration test verifying that a connector returns structured data:

describe("retrieving Hygieia data", () => {
  it("should return counts of merged pull requests per day", async () => {
    const result = await hygieiaConnector.getResultsByDay(
      hygieiaConfigs.integrationFrequencyRoute,
      testTeam,
      startDate,
      endDate
    );

    expect(result.status).toEqual(200);
    expect(result.data).toBeInstanceOf(Array);
    expect(result.data[0]).toHaveProperty("value");
    expect(result.data[0]).toHaveProperty("dateStr");
  });

  it("should return an empty array if the team does not exist", async () => {
    const result = await hygieiaConnector.getResultsByDay(
      hygieiaConfigs.integrationFrequencyRoute,
      0,
      startDate,
      endDate
    );
    expect(result.data).toEqual([]);
  });
});

Subcategories

Service integration tests – Validate how the system under test responds to information from an external service. Use virtual services or static mocks; pair with contract tests to keep the doubles current.

Database integration tests – Validate query logic against a controlled data store. Prefer in-memory databases, isolated DB instances, or personalized datasets over shared live data.

Front-end integration tests – Render the component tree and interact with it the way a user would. Follow the accessibility order of operations for element selection: visible text and labels first, ARIA roles second, test IDs only as a last resort.

Anti-Patterns

Peeking behind the curtain – using tools that expose component internals (e.g., Enzyme’s instance() or state()) instead of testing from the user’s perspective.
Mocking too aggressively – replacing every collaborator turns an integration test into a unit test and removes the value of testing real interactions. Only mock what is necessary to maintain determinism.
Testing implementation details – asserting on internal state, private methods, or call counts rather than observable output.
Introducing a test user – creating an artificial actor that would never exist in production. Write tests from the perspective of a real end-user or API consumer.
Tolerating flaky tests – non-deterministic integration tests erode trust. Fix or remove them immediately.
Duplicating E2E scope – if the test integrates multiple deployed sub-systems with live network calls, it belongs in the E2E category, not here.

Connection to CD Pipeline

Integration tests form the largest portion of a healthy test suite (the “trophy” or the middle of the pyramid). They run alongside unit tests in the earliest CI stages:

Local development – run in watch mode or before committing.
PR verification – CI executes the full suite; failures block merge.
Trunk verification – CI reruns on the merged HEAD.

Because they are deterministic and fast, integration tests should always break the build. A team whose refactors break many tests likely has too few integration tests and too many fine-grained unit tests. As Kent C. Dodds advises: “Write tests, not too many, mostly integration.”

8.7.3 - Functional Tests

Deterministic tests that verify all modules of a sub-system work together from the actor’s perspective, using test doubles for external dependencies.

Adapted from Dojo Consortium

Definition

A functional test is a deterministic test that verifies all modules of a sub-system are working together. It introduces an actor – typically a user interacting with the UI or a consumer calling an API – and validates the ingress and egress of that actor within the system boundary. External sub-systems are replaced with test doubles to keep the test deterministic.

Functional tests cover broad-spectrum behavior: UI interactions, presentation logic, and business logic flowing through the full sub-system. They differ from end-to-end tests in that side effects are mocked and never cross boundaries outside the system’s control.

Functional tests are sometimes called component tests.

When to Use

You need to verify a complete user-facing feature from input to output within a single deployable unit (e.g., a service or a front-end application).
You want to test how the UI, business logic, and data layers interact without depending on live external services.
You need to simulate realistic user workflows – filling in forms, navigating pages, submitting API requests – while keeping the test fast and repeatable.
You are validating acceptance criteria for a user story and want a test that maps directly to the specified behavior.

If the test needs to reach a live external dependency, it is an E2E test. If it tests a single unit in isolation, it is a unit test.

Characteristics

Property	Value
Speed	Seconds (slower than unit, faster than E2E)
Determinism	Always deterministic
Scope	All modules within a single sub-system
Dependencies	External systems replaced with test doubles
Network	Localhost only
Database	Localhost / in-memory only
Breaks build	Yes

Examples

A functional test for a REST API using an in-process server and mocked downstream services:

describe("POST /orders", () => {
  it("should create an order and return 201", async () => {
    // Arrange: mock the inventory service response
    nock("https://inventory.internal")
      .get("/stock/item-42")
      .reply(200, { available: true, quantity: 10 });

    // Act: send a request through the full application stack
    const response = await request(app)
      .post("/orders")
      .send({ itemId: "item-42", quantity: 2 });

    // Assert: verify the user-facing response
    expect(response.status).toBe(201);
    expect(response.body.orderId).toBeDefined();
    expect(response.body.status).toBe("confirmed");
  });

  it("should return 409 when inventory is insufficient", async () => {
    nock("https://inventory.internal")
      .get("/stock/item-42")
      .reply(200, { available: true, quantity: 0 });

    const response = await request(app)
      .post("/orders")
      .send({ itemId: "item-42", quantity: 2 });

    expect(response.status).toBe(409);
    expect(response.body.error).toMatch(/insufficient/i);
  });
});

A front-end functional test exercising a login flow with a mocked auth service:

describe("Login page", () => {
  it("should redirect to the dashboard after successful login", async () => {
    mockAuthService.login.mockResolvedValue({ token: "abc123" });

    render(<App />);
    await userEvent.type(screen.getByLabelText("Email"), "ada@example.com");
    await userEvent.type(screen.getByLabelText("Password"), "s3cret");
    await userEvent.click(screen.getByRole("button", { name: "Sign in" }));

    expect(await screen.findByText("Dashboard")).toBeInTheDocument();
  });
});

Anti-Patterns

Using live external services – this makes the test non-deterministic and slow. Use test doubles for anything outside the sub-system boundary.
Testing through the database – sharing a live database between tests introduces ordering dependencies and flakiness. Use in-memory databases or mocked data layers.
Ignoring the actor’s perspective – functional tests should interact with the system the way a user or consumer would. Reaching into internal APIs or bypassing the UI defeats the purpose.
Duplicating unit test coverage – functional tests should focus on feature-level behavior and happy/critical paths, not every edge case. Leave permutation testing to unit tests.
Slow test setup – if spinning up the sub-system takes too long, invest in faster bootstrapping (in-memory stores, lazy initialization) rather than skipping functional tests.

Connection to CD Pipeline

Functional tests run after unit and integration tests in the pipeline, typically as part of the same CI stage:

PR verification – functional tests run against the sub-system in isolation, giving confidence that the feature works before merge.
Trunk verification – the same tests run on the merged HEAD to catch conflicts.
Pre-deployment gate – functional tests can serve as the final deterministic gate before a build artifact is promoted to a staging environment.

Because functional tests are deterministic, they should break the build on failure. They are more expensive than unit and integration tests, so teams should focus on happy-path and critical-path scenarios while keeping the total count manageable.

8.7.4 - End-to-End Tests

Non-deterministic tests that validate the entire software system along with its integration with external interfaces and production-like scenarios.

Adapted from Dojo Consortium

Definition

End-to-end (E2E) tests validate the entire software system, including its integration with external interfaces. They exercise complete production-like scenarios using real (or production-like) data and environments to simulate real-time settings. No test doubles are used – the test hits live services, databases, and third-party integrations just as a real user would.

Because they depend on external systems, E2E tests are typically non-deterministic: they can fail for reasons unrelated to code correctness, such as network instability or third-party outages.

When to Use

E2E tests should be the least-used test type due to their high cost in execution time and maintenance. Use them for:

Happy-path validation of critical business flows (e.g., user signup, checkout, payment processing).
Smoke testing a deployed environment to verify that key integrations are functioning.
Cross-team workflows that span multiple sub-systems and cannot be tested any other way.

Do not use E2E tests to cover edge cases, error handling, or input validation – those scenarios belong in unit, integration, or functional tests.

Vertical vs. Horizontal E2E Tests

Vertical E2E tests target features under the control of a single team:

Favoriting an item and verifying it persists across refresh.
Creating a saved list and adding items to it.

Horizontal E2E tests span multiple teams:

Navigating from the homepage through search, item detail, cart, and checkout.

Horizontal tests are significantly more complex and fragile. Due to their large failure surface area, they are not suitable for blocking release pipelines.

Characteristics

Property	Value
Speed	Seconds to minutes per test
Determinism	Typically non-deterministic
Scope	Full system including external integrations
Dependencies	Real services, databases, third-party APIs
Network	Full network access
Database	Live databases
Breaks build	Generally no (see guidance below)

Examples

A vertical E2E test verifying user lookup through a live web interface:

@Test
public void verifyValidUserLookup() throws Exception {
    // Act -- interact with the live application
    homePage.getUserData("validUserId");
    waitForElement(By.xpath("//span[@id='name']"));

    // Assert -- verify real data returned from the live backend
    assertEquals("Ada Lovelace", homePage.getName());
    assertEquals("Engineering", homePage.getOrgName());
    assertEquals("Grace Hopper", homePage.getManagerName());
}

A browser-based E2E test using a tool like Playwright:

test("user can add an item to cart and check out", async ({ page }) => {
  await page.goto("https://staging.example.com");
  await page.getByRole("link", { name: "Running Shoes" }).click();
  await page.getByRole("button", { name: "Add to Cart" }).click();

  await page.getByRole("link", { name: "Cart" }).click();
  await expect(page.getByText("Running Shoes")).toBeVisible();

  await page.getByRole("button", { name: "Checkout" }).click();
  await expect(page.getByText("Order confirmed")).toBeVisible();
});

Anti-Patterns

Using E2E tests as the primary safety net – this is the “ice cream cone” anti-pattern. E2E tests are slow and fragile; the majority of your confidence should come from unit and integration tests.
Blocking the pipeline with horizontal E2E tests – these tests span too many teams and failure surfaces. Run them asynchronously and review failures out of band.
Ignoring flaky failures – E2E tests often fail for environmental reasons. Track the frequency and root cause of failures. If a test is not providing signal, fix it or remove it.
Testing edge cases in E2E – exhaustive input validation and error-path testing should happen in cheaper, faster test types.
Not capturing failure context – E2E failures are expensive to debug. Capture screenshots, network logs, and video recordings automatically on failure.

Connection to CD Pipeline

E2E tests run in the later stages of the delivery pipeline, after the build artifact has passed all deterministic tests and has been deployed to a staging or pre-production environment:

Post-deployment smoke tests – a small, fast suite of vertical E2E tests verifies that the deployment succeeded and critical paths work.
Scheduled regression suites – broader E2E suites (including horizontal tests) run on a schedule rather than on every commit.
Production monitoring – customer experience alarms (synthetic monitoring) are a form of continuous E2E testing that runs in production.

Because E2E tests are non-deterministic, they should not break the build in most cases. A team may choose to gate on a small set of highly reliable vertical E2E tests, but must invest in reducing false positives to make this valuable. CD pipelines should be optimized for rapid recovery of production issues rather than attempting to prevent all defects with slow, fragile E2E gates.

8.7.5 - Contract Tests

Non-deterministic tests that validate test doubles by verifying API contract format against live external systems.

Adapted from Dojo Consortium

Definition

A contract test validates that the test doubles used in integration tests still accurately represent the real external system. Contract tests run against the live external sub-system and exercise the portion of the code that interfaces with it. Because they depend on live services, contract tests are non-deterministic and should not break the build. Instead, failures should trigger a review to determine whether the contract has changed and the test doubles need updating.

A contract test validates contract format, not specific data. It verifies that response structures, field names, types, and status codes match expectations – not that particular values are returned.

Contract tests have two perspectives:

Provider – the team that owns the API verifies that all changes are backwards compatible (unless a new API version is introduced). Every build should validate the provider contract.
Consumer – the team that depends on the API verifies that they can still consume the properties they need, following Postel’s Law: “Be conservative in what you do, be liberal in what you accept from others.”

When to Use

You have integration tests that use test doubles (mocks, stubs, recorded responses) to represent external services, and you need assurance those doubles remain accurate.
You consume a third-party or cross-team API that may change without notice.
You provide an API to other teams and want to ensure that your changes do not break their expectations (consumer-driven contracts).
You are adopting contract-driven development, where contracts are defined during design so that provider and consumer teams can work in parallel using shared mocks and fakes.

Characteristics

Property	Value
Speed	Seconds (depends on network latency)
Determinism	Non-deterministic (hits live services)
Scope	Interface boundary between two systems
Dependencies	Live external sub-system
Network	Yes – calls the real dependency
Database	Depends on the provider
Breaks build	No – failures trigger review, not build failure

Examples

A provider contract test verifying that an API response matches the expected schema:

describe("GET /users/:id contract", () => {
  it("should return a response matching the user schema", async () => {
    const response = await fetch("https://api.partner.com/users/1");
    const body = await response.json();

    // Validate structure, not specific data
    expect(response.status).toBe(200);
    expect(body).toHaveProperty("id");
    expect(typeof body.id).toBe("number");
    expect(body).toHaveProperty("name");
    expect(typeof body.name).toBe("string");
    expect(body).toHaveProperty("email");
    expect(typeof body.email).toBe("string");
  });
});

A consumer-driven contract test using Pact:

describe("Order Service - Inventory Provider Contract", () => {
  it("should receive stock availability in the expected format", async () => {
    // Define the expected interaction
    await provider.addInteraction({
      state: "item-42 is in stock",
      uponReceiving: "a request for item-42 stock",
      withRequest: { method: "GET", path: "/stock/item-42" },
      willRespondWith: {
        status: 200,
        body: {
          available: Matchers.boolean(true),
          quantity: Matchers.integer(10),
        },
      },
    });

    // Exercise the consumer code against the mock provider
    const result = await inventoryClient.checkStock("item-42");
    expect(result.available).toBe(true);
  });
});

Anti-Patterns

Using contract tests to validate business logic – contract tests verify structure and format, not behavior. Business logic belongs in functional tests.
Breaking the build on contract test failure – because these tests hit live systems, failures may be caused by network issues or temporary outages, not actual contract changes. Treat failures as signals to investigate.
Neglecting to update test doubles – when a contract test fails because the upstream API changed, the test doubles in your integration tests must be updated to match. Ignoring failures defeats the purpose.
Running contract tests too infrequently – the frequency should be proportional to the volatility of the interface. Highly active APIs need more frequent contract validation.
Testing specific data values – asserting that name equals "Alice" makes the test brittle. Assert on types, required fields, and response codes instead.

Connection to CD Pipeline

Contract tests run asynchronously from the main CI build, typically on a schedule:

Provider side – provider contract tests (schema validation, response code checks) are often implemented as deterministic unit tests and run on every commit as part of the provider’s CI pipeline.
Consumer side – consumer contract tests run on a schedule (e.g., hourly or daily) against the live provider. Failures are reviewed and may trigger updates to test doubles or conversations between teams.
Consumer-driven contracts – when using tools like Pact, the consumer publishes contract expectations and the provider runs them continuously. Both teams communicate when contracts break.

Contract tests are the bridge that keeps your fast, deterministic integration test suite honest. Without them, test doubles can silently drift from reality, and your integration tests provide false confidence.

8.7.6 - Static Analysis

Code analysis tools that evaluate non-running code for security vulnerabilities, complexity, and best practice violations.

Adapted from Dojo Consortium

Definition

Static analysis (also called static testing) evaluates non-running code against rules for known good practices. Unlike other test types that execute code and observe behavior, static analysis inspects source code, configuration files, and dependency manifests to detect problems before the code ever runs.

Static analysis serves several key purposes:

Catches errors that would otherwise surface at runtime.
Warns of excessive complexity that degrades the ability to change code safely.
Identifies security vulnerabilities and coding patterns that provide attack vectors.
Enforces coding standards by removing subjective style debates from code reviews.
Alerts to dependency issues – outdated packages, known CVEs, license incompatibilities, or supply-chain compromises.

When to Use

Static analysis should run continuously, at every stage where feedback is possible:

In the IDE – real-time feedback as developers type, via editor plugins and language server integrations.
On save – format-on-save and lint-on-save catch issues immediately.
Pre-commit – hooks prevent problematic code from entering version control.
In CI – the full suite of static checks runs on every PR and on the trunk after merge, verifying that earlier local checks were not bypassed.

Static analysis is always applicable. Every project, regardless of language or platform, benefits from linting, formatting, and dependency scanning.

Characteristics

Property	Value
Speed	Seconds (typically the fastest test category)
Determinism	Always deterministic
Scope	Entire codebase (source, config, dependencies)
Dependencies	None (analyzes code at rest)
Network	None (except dependency scanners)
Database	None
Breaks build	Yes

Examples

Linting

A .eslintrc.json configuration enforcing test quality rules:

{
  "rules": {
    "jest/no-disabled-tests": "warn",
    "jest/expect-expect": "error",
    "jest/no-commented-out-tests": "error",
    "jest/valid-expect": "error",
    "no-unused-vars": "error",
    "no-console": "warn"
  }
}

Type Checking

TypeScript catches type mismatches at compile time, eliminating entire classes of runtime errors:

function calculateTotal(price: number, quantity: number): number {
  return price * quantity;
}

// Static analysis error: Argument of type 'string' is not assignable
// to parameter of type 'number'.
calculateTotal("19.99", 3);

Dependency Scanning

Tools like npm audit, Snyk, or Dependabot scan for known vulnerabilities:

$ npm audit
found 2 vulnerabilities (1 moderate, 1 high)
  moderate: Prototype Pollution in lodash < 4.17.21
  high:     Remote Code Execution in log4j < 2.17.1

Types of Static Analysis

Type	Purpose
Linting	Catches common errors and enforces best practices
Formatting	Enforces consistent code style, removing subjective debates
Complexity analysis	Flags overly deep or long code blocks that breed defects
Type checking	Prevents type-related bugs, replacing some unit tests
Security scanning	Detects known vulnerabilities and dangerous coding patterns
Dependency scanning	Checks for outdated, hijacked, or insecurely licensed deps

Anti-Patterns

Disabling rules instead of fixing code – suppressing linter warnings or ignoring security findings erodes the value of static analysis over time.
Not customizing rules – default rulesets are a starting point. Write custom rules for patterns that come up repeatedly in code reviews.
Running static analysis only in CI – by the time CI reports a formatting error, the developer has context-switched. IDE plugins and pre-commit hooks provide immediate feedback.
Ignoring dependency vulnerabilities – known CVEs in dependencies are a direct attack vector. Treat high-severity findings as build-breaking.
Treating static analysis as optional – static checks should be mandatory and enforced. If developers can bypass them, they will.

Connection to CD Pipeline

Static analysis is the first gate in the CD pipeline, providing the fastest feedback:

IDE / local development – plugins run in real time as code is written.
Pre-commit – hooks run linters and formatters, blocking commits that violate rules.
PR verification – CI runs the full static analysis suite (linting, type checking, security scanning, dependency auditing) and blocks merge on failure.
Trunk verification – the same checks re-run on the merged HEAD to catch anything missed.
Scheduled scans – dependency and security scanners run on a schedule to catch newly disclosed vulnerabilities in existing dependencies.

Because static analysis requires no running code, no test environment, and no external dependencies, it is the cheapest and fastest form of quality verification. A mature CD pipeline treats static analysis failures the same as test failures: they break the build.

8.7.7 - Test Doubles

Patterns for isolating dependencies in tests: stubs, mocks, fakes, spies, and dummies.

Adapted from Dojo Consortium

Definition

Test doubles are stand-in objects that replace real production dependencies during testing. The term comes from the film industry’s “stunt double” – just as a stunt double replaces an actor for dangerous scenes, a test double replaces a costly or non-deterministic dependency to make tests fast, isolated, and reliable.

Test doubles allow you to:

Remove non-determinism by replacing network calls, databases, and file systems with predictable substitutes.
Control test conditions by forcing specific states, error conditions, or edge cases that would be difficult to reproduce with real dependencies.
Increase speed by eliminating slow I/O operations.
Isolate the system under test so that failures point directly to the code being tested, not to an external dependency.

Types of Test Doubles

Type	Description	Example Use Case
Dummy	Passed around but never actually used. Fills parameter lists.	A required logger parameter in a constructor.
Stub	Provides canned answers to calls made during the test. Does not respond to anything outside what is programmed.	Returning a fixed user object from a repository.
Spy	A stub that also records information about how it was called (arguments, call count, order).	Verifying that an analytics event was sent once.
Mock	Pre-programmed with expectations about which calls will be made. Verification happens on the mock itself.	Asserting that `sendEmail()` was called with specific arguments.
Fake	Has a working implementation, but takes shortcuts not suitable for production.	An in-memory database replacing PostgreSQL.

Choosing the Right Double

Use stubs when you need to supply data but do not care how it was requested.
Use spies when you need to verify call arguments or call count.
Use mocks when the interaction itself is the primary thing being verified.
Use fakes when you need realistic behavior but cannot use the real system.
Use dummies when a parameter is required by the interface but irrelevant to the test.

When to Use

Test doubles are used in every layer of deterministic testing:

Unit tests – nearly all dependencies are replaced with test doubles to achieve full isolation.
Integration tests – external sub-systems (APIs, databases, message queues) are replaced, but internal collaborators remain real.
Functional tests – dependencies that cross the sub-system boundary are replaced to maintain determinism.

Test doubles should be used less in later pipeline stages. End-to-end tests use no test doubles by design.

Examples

A JavaScript stub providing a canned response:

// Stub: return a fixed user regardless of input
const userRepository = {
  findById: jest.fn().mockResolvedValue({
    id: "u1",
    name: "Ada Lovelace",
    email: "ada@example.com",
  }),
};

const user = await userService.getUser("u1");
expect(user.name).toBe("Ada Lovelace");

A Java spy verifying interaction:

@Test
public void shouldCallUserServiceExactlyOnce() {
    UserService spyService = Mockito.spy(userService);
    doReturn(testUser).when(spyService).getUserInfo("u123");

    User result = spyService.getUserInfo("u123");

    verify(spyService, times(1)).getUserInfo("u123");
    assertEquals("Ada", result.getName());
}

A fake in-memory repository:

class FakeUserRepository {
  constructor() {
    this.users = new Map();
  }
  save(user) {
    this.users.set(user.id, user);
  }
  findById(id) {
    return this.users.get(id) || null;
  }
}

Anti-Patterns

Mocking what you do not own – wrapping a third-party API in a thin adapter and mocking the adapter is safer than mocking the third-party API directly. Direct mocks couple your tests to the library’s implementation.
Over-mocking – replacing every collaborator with a mock turns the test into a mirror of the implementation. Tests become brittle and break on every refactor. Only mock what is necessary to maintain determinism.
Not validating test doubles – if the real dependency changes its contract, your test doubles silently drift. Use contract tests to keep doubles honest.
Complex mock setup – if setting up mocks requires dozens of lines, the system under test may have too many dependencies. Consider refactoring the production code rather than adding more mocks.
Using mocks to test implementation details – asserting on the exact sequence and count of internal method calls creates change-detector tests. Prefer asserting on observable output.

Connection to CD Pipeline

Test doubles are a foundational technique that enables the fast, deterministic tests required for continuous delivery:

Early pipeline stages (static analysis, unit tests, integration tests) rely heavily on test doubles to stay fast and deterministic. This is where the majority of defects are caught.
Later pipeline stages (E2E tests, production monitoring) use fewer or no test doubles, trading speed for realism.
Contract tests run asynchronously to validate that test doubles still match reality, closing the gap between the deterministic and non-deterministic stages of the pipeline.

The guiding principle from Justin Searls applies: “Don’t poke too many holes in reality.” Use test doubles when you must, but prefer real implementations when they are fast and deterministic.

Your Migration Journey

Where to Start

Content Sources

1 - Quality and Delivery Anti-Patterns

Anti-pattern index

1.1 - Team Workflow

1.1.1 - Pull Request Review Bottlenecks

What This Looks Like

Why This Is a Problem

It blocks continuous integration

It inflates cycle time

It degrades the review quality it is supposed to protect

It creates hidden WIP

Impact on continuous delivery

How to Fix It

Step 1: Measure review turnaround time (Week 1)

Step 2: Set a team review SLA (Week 1)

Step 3: Make reviews a first-class activity (Week 2)

Step 4: Consider synchronous review (Week 3+)

Step 5: Address the objections

Measuring Progress

Related Content

1.1.2 - Work Items Too Large

What This Looks Like

Why This Is a Problem

They prevent daily integration

They make estimation meaningless

They increase rework

They hide risk until the end

Impact on continuous delivery

How to Fix It

Step 1: Establish the 2-day rule (Week 1)

Step 2: Learn vertical slicing (Week 2)

Step 3: Use acceptance criteria as a splitting signal (Week 2+)

Step 4: Decompose during refinement, not during the sprint (Week 3+)

Step 5: Address the objections

Measuring Progress

Related Content

1.1.3 - No Vertical Slicing

What This Looks Like

Why This Is a Problem

Nothing is deployable until everything is done

Integration risk accumulates invisibly

Feedback is delayed until it is expensive to act on

It creates specialist dependencies and handoff delays

Impact on continuous delivery

How to Fix It

Step 1: Learn to recognize horizontal slices (Week 1)

Step 2: Reslice one feature vertically (Week 2)

Step 3: Use the deployability test in refinement (Week 3+)

Step 4: Break the specialist habit (Week 4+)

Step 5: Address the objections

Measuring Progress

Related Content

1.1.4 - Too Much Work in Progress

What This Looks Like

Why This Is a Problem

It destroys focus and increases context switching

It inflates cycle time

It hides bottlenecks

It prevents swarming and collaboration

Impact on continuous delivery

How to Fix It

Step 1: Make WIP visible (Week 1)

Step 2: Set an initial WIP limit (Week 2)

Step 3: Enforce the limit with swarming (Week 3+)

Step 4: Lower the limit over time (Monthly)

Step 5: Address the objections

Measuring Progress

Related Content

1.1.5 - Push-Based Work Assignment

What This Looks Like

Why This Is a Problem

It reduces quality

It increases rework

It makes delivery timelines unpredictable

It removes team ownership

Impact on continuous delivery

How to Fix It

Step 1: Order the backlog by priority (Week 1)