This is the multi-page printable view of this section.
Click here to print.
Return to the regular view of this page.
Your Migration Journey
A learning path for migrating to continuous delivery, built on years of hands-on experience helping teams remove friction and improve delivery outcomes.
This guide is a learning path built on years of helping teams across industries remove
friction, improve delivery outcomes, and raise team morale through continuous delivery.
It expands on the practices defined at MinimumCD.org and the
production-tested playbooks from the Dojo Consortium,
grounded in hands-on application of one driving question: “Why can’t I deliver today’s
work to production today?” Start with the problem your team feels most, then follow the
path to solving it.
Where to Start
- Anti-Patterns - Find the problems your team is facing and learn the concrete steps to fix each one.
- Brownfield CD - Already have a running system? A phased approach to migrating existing applications and teams to continuous delivery.
Content Sources
This guide adapts content from two CC BY 4.0 licensed sources:
Each adapted page includes attribution to its source material.
1 - Quality and Delivery Anti-Patterns
Start here. Find the anti-patterns your team is facing and learn the path to solving them.
Every team migrating to continuous delivery faces obstacles. Most are not unique to your team,
your technology, or your industry. This section catalogs the anti-patterns that hurt quality,
increase rework, and make delivery timelines unpredictable - then provides a concrete path to
fix each one.
Start with the problem you feel most. Each page links to the practices and migration phases
that address it.
Anti-pattern index
Sorted by quality impact so you can prioritize what to fix first.
1.1 - Team Workflow
Anti-patterns in how teams assign, coordinate, and manage the flow of work.
These anti-patterns affect how work moves through the team. They create bottlenecks, hide
problems, and prevent the steady flow of small changes that continuous delivery requires.
1.1.1 - Pull Request Review Bottlenecks
Pull requests sit for days waiting for review. Reviews happen in large batches. Authors have moved on by the time feedback arrives.
Category: Team Workflow | Quality Impact: High
What This Looks Like
A developer opens a pull request and waits. Hours pass. A day passes. They ping someone in chat.
The reviewer is busy with their own work. Eventually, late in the afternoon or the next morning,
comments arrive. The author has moved on to something else and has to reload context to respond.
Another round of comments. Another wait. The PR finally merges two or three days after it was
opened.
Common variations:
- The aging PR queue. The team has five or more open PRs at any given time. Some are days old.
Developers start new work while they wait, which creates more PRs, which creates more review
load, which slows reviews further.
- The designated reviewer. One or two senior developers review everything. They are
overwhelmed. Their review queue is a bottleneck that the rest of the team works around by
starting more work while they wait.
- The drive-by approval. Reviews are so slow that the team starts rubber-stamping PRs to
unblock each other. The review step exists in name only. Quality drops, but at least things
merge.
- The nitpick spiral. Reviewers leave dozens of style comments on formatting, naming, and
conventions that could be caught by a linter. Each round triggers another round. A 50-line
change accumulates 30 comments across three review cycles.
- The “I’ll get to it” pattern. When asked about a pending review, the answer is always “I’ll
look at it after I finish this.” But they never finish “this” because they have their own work,
and reviewing someone else’s code is never the priority.
The telltale sign: the team tracks PR age and the average is measured in days, not hours.
Why This Is a Problem
Slow code review is not just an inconvenience. It is a systemic bottleneck that undermines
continuous integration, inflates cycle time, and degrades the quality it is supposed to protect.
It blocks continuous integration
Trunk-based development requires integrating to trunk at least once per day. A PR that sits for
two days makes daily integration impossible. The branch diverges from trunk while it waits. Other
developers make changes to the same files. By the time the review is done, the PR has merge
conflicts that require additional work to resolve.
This is a compounding problem. Slow reviews cause longer-lived branches. Longer-lived branches
cause larger merge conflicts. Larger merge conflicts make integration painful. Painful integration
makes the team dread merging, which makes them delay opening PRs until the work is “complete,”
which makes PRs larger, which makes reviews take longer.
In teams where reviews complete within hours, branches rarely live longer than a day. Merge
conflicts are rare because changes are small and trunk has not moved far since the branch was
created.
It inflates cycle time
Every hour a PR waits for review is an hour added to cycle time. For a story that takes four hours
to code, a two-day review wait means the review step dominates the total cycle time. The coding
was fast. The pipeline is fast. But the work sits idle for days because a human has not looked at
it yet.
This wait time is pure waste. Nothing is happening to the code while it waits. No value is being
delivered. The change is done but not integrated, tested in the full pipeline, or deployed. It is
inventory sitting on the shelf.
When reviews happen within two hours, the review step nearly disappears from the cycle time
measurement. Code flows from development to trunk to production with minimal idle time.
It degrades the review quality it is supposed to protect
Slow reviews produce worse reviews, not better ones. When a reviewer sits down to review a PR that
was opened two days ago, they have no context on the author’s thinking. They review the code cold,
missing the intent behind decisions. They leave comments that the author already considered and
rejected, triggering unnecessary back-and-forth.
Large PRs make this worse. When a review has been delayed, the author often keeps working on the
same branch, adding more changes to avoid opening a second PR while the first one waits. What
started as a 50-line change becomes a 300-line change. Research consistently shows that reviewer
effectiveness drops sharply after 200 lines. Large PRs get superficial reviews - the reviewer
skims the diff, leaves a few surface-level comments, and approves because they do not have time
to review it thoroughly.
Fast reviews are better reviews. A reviewer who looks at a 50-line change within an hour of it
being opened has full context on what the team is working on, can ask the author questions in real
time, and can give focused attention to a small, digestible change.
It creates hidden WIP
Every open PR is work in progress. The code is written but not integrated. The developer who
authored it has moved on to something new, but their previous work is still “in progress” from the
team’s perspective. A team of five with eight open PRs has eight items of hidden WIP that do not
appear on the sprint board as “in progress” but consume the same attention.
This hidden WIP interacts badly with explicit WIP. A developer who has one story “in progress” on
the board but three PRs waiting for review is actually juggling four streams of work. Each PR that
gets comments requires a context switch back to code they wrote days ago. The cognitive overhead is
real even if the board does not show it.
Impact on continuous delivery
Continuous delivery requires that every change move from commit to production quickly and
predictably. Review bottlenecks create an unpredictable queue between “code complete” and
“integrated.” The queue length varies based on reviewer availability, competing priorities, and
team habits. Some PRs merge in hours, others take days. This variability makes delivery timelines
unpredictable and prevents the steady flow of small changes that CD depends on.
The bottleneck also discourages the small, frequent changes that make CD safe. Developers learn
that every PR costs a multi-day wait, so they batch changes into larger PRs to reduce the number
of times they pay that cost. Larger PRs are riskier, harder to review, and more likely to cause
problems - exactly the opposite of what CD requires.
How to Fix It
Step 1: Measure review turnaround time (Week 1)
You cannot fix what you do not measure. Start tracking two numbers:
- Time to first review: elapsed time from PR opened to first reviewer comment or approval.
- PR age at merge: elapsed time from PR opened to PR merged.
Most teams discover their average is far worse than they assumed. Developers think reviews happen
in a few hours. The data shows days.
Step 2: Set a team review SLA (Week 1)
Agree as a team on a review turnaround target. A reasonable starting point:
- Reviews within 2 hours during working hours.
- PR age at merge under 24 hours.
Write this down as a working agreement. Post it on the board. This is not a suggestion - it is a
team commitment.
Step 3: Make reviews a first-class activity (Week 2)
The core behavior change: reviewing code is not something you do when you have spare time. It is
the highest-priority activity after your current task reaches a natural stopping point.
Concrete practices:
- Check for open PRs before starting new work. When a developer finishes a task or hits a
natural pause, their first action is to check for pending reviews, not to pull a new story.
- Auto-assign reviewers. Do not wait for someone to volunteer. Configure your tools to
assign a reviewer automatically when a PR is opened.
- Rotate reviewers. Do not let one or two people carry all the review load. Any team member
should be able to review any PR. This spreads knowledge and distributes the work.
- Keep PRs small. Target under 200 lines of changed code. Small PRs get reviewed faster and
more thoroughly. If a developer says their PR is “too large to split,” that is a work
decomposition problem.
Step 4: Consider synchronous review (Week 3+)
The fastest review is one that happens in real time. If async review consistently exceeds the
team’s SLA, move toward synchronous alternatives:
| Method |
How it works |
Review wait time |
| Pair programming |
Two developers write the code together. Review is continuous. |
Zero |
| Over-the-shoulder |
Author walks reviewer through the change on a call. |
Minutes |
| Rapid async |
PR opened, reviewer notified, review within 2 hours. |
Under 2 hours |
| Traditional async |
PR opened, reviewer gets to it when they can. |
Hours to days |
Pair programming eliminates the review bottleneck entirely. The code is reviewed as it is written.
There is no PR, no queue, and no wait. For teams that struggle with review latency, pairing is
often the most effective solution.
Step 5: Address the objections
| Objection |
Response |
| “I can’t drop what I’m doing to review” |
You are not dropping everything. You are checking for reviews at natural stopping points: after a commit, after a test passes, after a meeting. Reviews that take 10 minutes should not require “dropping” anything. |
| “Reviews take too long because the PRs are too big” |
Then the PRs need to be smaller. A 50-line change takes 5-10 minutes to review. The review is not the bottleneck - the PR size is. |
| “Only senior developers can review this code” |
That is a knowledge silo. Rotate reviewers so that everyone builds familiarity with every part of the codebase. Junior developers reviewing senior code is learning. Senior developers reviewing junior code is mentoring. Both are valuable. |
| “We need two reviewers for compliance” |
Check whether your compliance framework actually requires two human reviewers, or whether it requires two sets of eyes on the code. Pair programming satisfies most separation-of-duties requirements while eliminating review latency. |
| “We tried faster reviews and quality dropped” |
Fast does not mean careless. Automate style checks so reviewers focus on logic, correctness, and design. Small PRs get better reviews than large ones regardless of speed. |
Measuring Progress
| Metric |
What to look for |
| Time to first review |
Should drop below 2 hours |
| PR age at merge |
Should drop below 24 hours |
| Open PR count |
Should stay low - ideally fewer than the number of team members |
| PR size (lines changed) |
Should trend below 200 lines |
| Review rework cycles |
Should stay under 2 rounds per PR |
| Development cycle time |
Should decrease as review wait time drops |
Related Content
1.1.2 - Work Items Too Large
Work items regularly take more than a week. Developers work on a single item for days without integrating.
Category: Team Workflow | Quality Impact: High
What This Looks Like
A developer picks up a work item on Monday. By Wednesday, they are still working on it. By Friday,
it is “almost done.” The following Monday, they are fixing edge cases. The item finally moves to
review mid-week - a 300-line pull request that the reviewer does not have time to look at
carefully.
Common variations:
- The week-long item. Work items routinely take five or more days. Developers work on a single
item for an entire sprint without integrating to trunk. The branch diverges further every day.
- The “it’s really just one thing” item. A ticket titled “Add user profile page” hides a
login form, avatar upload, email verification, notification preferences, and password reset.
It looks like one feature to the product owner. It is six features to the developer.
- The point-inflated item. The team estimates work at 8 or 13 points. Nobody questions
whether an 8-point item should be decomposed. High estimates are treated as a property of the
work rather than a signal that the work is too big.
- The “spike that became a feature.” A time-boxed investigation turns into an implementation.
The developer keeps going because they have momentum, and the result is a large, unreviewed
change that was never planned or decomposed.
- The horizontal slice. Work is split by technical layer: “build the database schema,”
“build the API,” “build the UI.” Each item takes days because it spans the entire layer.
Nothing is deployable until all three are done.
The telltale sign: look at the team’s cycle time distribution. If work items regularly take five
or more days from start to done, the items are too large.
Why This Is a Problem
Large work items are not just slow. They are a compounding force that makes every other part of
the delivery process worse.
They prevent daily integration
Trunk-based development requires integrating to trunk at least once per day. A work item that
takes a week to complete cannot be integrated daily unless it is decomposed into smaller pieces
that are each independently integrable. Most teams with large work items do not decompose them -
they work on a branch for the full duration and merge at the end.
This means a week of work is invisible to the rest of the team until it lands as a single large
merge. A week of assumptions go untested against the real state of trunk. A week of potential
merge conflicts accumulate silently.
When work items are small enough to complete in one to two days, each item is a natural
integration point. The developer finishes the item, integrates to trunk, and the change is
tested, reviewed, and deployed before the next item begins.
They make estimation meaningless
Large work items hide unknowns. An item estimated at 8 points might take three days or three
weeks depending on what the developer discovers along the way. The estimate is a guess wrapped in
false precision.
This makes planning unreliable. The team commits to a set of large items, discovers mid-sprint
that one of them is twice as big as expected, and scrambles at the end. The retrospective
identifies “estimation accuracy” as the problem, but the real problem is that the items were too
big to estimate accurately in the first place.
Small work items are inherently more predictable. An item that takes one to two days has a narrow
range of uncertainty. Even if the estimate is off, it is off by hours, not weeks. Plans built
from small items are more reliable because the variance of each item is small.
They increase rework
A developer working on a large item makes dozens of decisions over several days: architectural
choices, naming conventions, error handling approaches, API contracts. These decisions are made in
isolation. Nobody sees them until the code review, which happens after all the work is done.
When the reviewer disagrees with a fundamental decision made on day one, the developer has built
five days of work on top of it. The rework cost is enormous. They either rewrite large portions
of the code or the team accepts a suboptimal decision because the cost of changing it is too high.
With small items, decisions surface quickly. A one-day item produces a small pull request that is
reviewed within hours. If the reviewer disagrees with an approach, the cost of changing it is a
few hours of work, not a week. Fundamental design problems are caught early, before layers of
code are built on top of them.
They hide risk until the end
A large work item carries risk that is invisible until late in its lifecycle. The developer might
discover on day four that the chosen approach does not work, that an API they depend on behaves
differently than documented, or that the database cannot handle the query pattern they assumed.
When this discovery happens on day four of a five-day item, the options are bad: rush a fix, cut
scope, or miss the sprint commitment. The team had no visibility into the risk because the work
was a single opaque block on the board.
Small items surface risk early. If the approach does not work, the team discovers it on day one
of a one-day item. The cost of changing direction is minimal. The risk is contained to a small
unit of work rather than spreading across an entire feature.
Impact on continuous delivery
Continuous delivery is built on small, frequent, low-risk changes flowing through the pipeline.
Large work items produce the opposite: infrequent, high-risk changes that batch up in branches
and land as large merges.
A team with five developers working on five large items has zero deployable changes for days at a
time. Then several large changes land at once, the pipeline is busy for hours, and conflicts
between the changes create unexpected failures. This is batch-and-queue delivery wearing agile
clothing.
The feedback loop is broken too. A small change deployed to production gives immediate signal:
does the change work? Does it affect performance? Do users behave as expected? A large change
deployed after a week gives noisy signal: something changed, but which of the fifty modifications
caused the issue?
How to Fix It
Step 1: Establish the 2-day rule (Week 1)
Agree as a team: no work item should take longer than two days from start to integrated on
trunk.
This is not a velocity target. It is a constraint on item size. If an item cannot be completed
in two days, it must be decomposed before it is pulled into the sprint.
Write this as a working agreement and enforce it during planning. When someone estimates an item
at more than two days, the response is “how do we split this?” - not “who can do it faster?”
Step 2: Learn vertical slicing (Week 2)
The most common decomposition mistake is horizontal slicing - splitting by technical layer instead
of by user-visible behavior. Train the team on vertical slicing:
Horizontal (avoid):
| Work item |
Deployable? |
Testable end-to-end? |
| Build the database schema for orders |
No |
No |
| Build the API for orders |
No |
No |
| Build the UI for orders |
Only after all three are done |
Only after all three are done |
Vertical (prefer):
| Work item |
Deployable? |
Testable end-to-end? |
| User can create a basic order (DB + API + UI) |
Yes |
Yes |
| User can add a discount to an order |
Yes |
Yes |
| User can view order history |
Yes |
Yes |
Each vertical slice cuts through all layers to deliver a thin piece of complete functionality.
Each is independently deployable and testable. Each gives feedback before the next slice begins.
Step 3: Use acceptance criteria as a splitting signal (Week 2+)
Count the acceptance criteria on each work item. If an item has more than three to five acceptance
criteria, it is probably too big. Each criterion or small group of criteria can become its own
item.
Write acceptance criteria in concrete Given-When-Then format. Each scenario is a natural
decomposition boundary:
Each scenario can be implemented, integrated, and deployed independently.
Step 4: Decompose during refinement, not during the sprint (Week 3+)
Work items should arrive at planning already decomposed. If the team is splitting items
mid-sprint, refinement is not doing its job.
During backlog refinement:
- Product owner presents the feature or outcome.
- Team discusses the scope and writes acceptance criteria.
- If the item has more than three to five criteria, split it immediately.
- Each resulting item is estimated. Any item over two days is split again.
- Items enter the sprint already small enough to flow.
Step 5: Address the objections
| Objection |
Response |
| “Splitting creates too many items to manage” |
Small items are easier to manage, not harder. They have clear scope, predictable timelines, and simple reviews. The overhead per item should be near zero. If it is not, simplify your process. |
| “Some things can’t be done in two days” |
Almost anything can be decomposed further. Database migrations can be done in backward-compatible steps. UI changes can be hidden behind feature flags. The skill is finding the decomposition, not deciding whether one exists. |
| “We’ll lose the big picture if we split too much” |
The epic or feature still exists as an organizing concept. Small items are not independent fragments - they are ordered steps toward a defined outcome. Use an epic to track the overall feature and individual items to track the increments. |
| “Product doesn’t want partial features” |
Feature flags let you deploy incomplete features without exposing them to users. The code is integrated and tested continuously, but the user-facing feature is toggled on only when all slices are done. |
| “Our estimates are fine, items just take longer than expected” |
That is the definition of items being too big. Small items have narrow estimation variance. If a one-day item takes two days, you are off by a day. If a five-day item takes ten, you have lost a sprint. |
Measuring Progress
| Metric |
What to look for |
| Item cycle time |
Should be two days or less from start to trunk |
| Development cycle time |
Should decrease as items get smaller |
| Items completed per week |
Should increase even if total output stays the same |
| Integration frequency |
Should increase as developers integrate completed items daily |
| Items that exceed the 2-day rule |
Track violations and discuss in retrospectives |
| Work in progress |
Should decrease as smaller items flow through faster |
Related Content
1.1.3 - No Vertical Slicing
Work is organized by technical layer - “build the API,” “build the UI” - rather than by user-visible behavior. Nothing is deployable until all layers are done.
Category: Team Workflow | Quality Impact: Medium
What This Looks Like
The team breaks a feature into work items by architectural layer. One item for the database
schema. One for the API. One for the frontend. Maybe one for “integration testing” at the end.
Each item lives in a different lane or is assigned to a different specialist. Nothing reaches
production until the last layer is finished and all the pieces are stitched together.
Common variations:
- Layer-based assignment. “The backend team builds the API, the frontend team builds the UI.”
Each team delivers their layer independently. Integration is a separate phase that happens after
both teams are “done.”
- The database-first approach. Every feature starts with “build the schema.” Weeks of database
work happen before any API or UI exists. The schema is designed for the complete feature rather
than for the first thin slice.
- The API-then-UI pattern. The API is built and “tested” in isolation with Postman or curl.
The UI is built weeks later against the API. Mismatches between what the API provides and what
the UI needs are discovered at the end.
- The “integration sprint.” After the layers are built separately, the team dedicates a sprint
to wiring everything together. This sprint always takes longer than planned because the layers
were built on different assumptions.
- Technical stories on the board. The backlog contains items like “create database indexes,”
“add caching layer,” or “refactor service class.” None of these deliver user-visible value. They
are infrastructure work that has been separated from the feature it supports.
The telltale sign: ask “can we deploy this work item to production and have a user see something
different?” If the answer is no, the work is sliced horizontally.
Why This Is a Problem
Horizontal slicing feels natural to developers because it matches how they think about the
system’s architecture. But it optimizes for how the code is organized, not for how value is
delivered. The consequences compound across every dimension of delivery.
Nothing is deployable until everything is done
A horizontal slice delivers no user-visible value on its own. The database schema alone does
nothing. The API alone does nothing a user can see. The UI alone has no data to display. Value
only emerges when all layers are assembled - and that assembly happens at the end.
This means the team has zero deployable output for the entire duration of the feature build. A
feature that takes three sprints to build across layers produces three sprints of work in progress
and zero deliverables. The team is busy the entire time, but nothing reaches production.
With vertical slicing, every item is deployable. The first slice might be “user can create a
basic order” - thin, but it touches the database, API, and UI. It can be deployed to production
behind a feature flag on day two. Feedback starts immediately. The remaining slices build on a
working foundation rather than converging on an untested one.
Integration risk accumulates invisibly
When layers are built separately, each team or developer makes assumptions about how their layer
will connect to the others. The backend developer assumes the API contract looks a certain way.
The frontend developer assumes the response format matches their component design. The database
developer assumes the query patterns align with how the API will call them.
These assumptions are untested until integration. The longer the layers are built in isolation,
the more assumptions accumulate and the more likely they are to conflict. Integration becomes the
riskiest phase of the project - the phase where all the hidden mismatches surface at once.
With vertical slicing, integration happens with every item. The first slice forces the developer
to connect all the layers immediately. Assumptions are tested on day one, not month three.
Subsequent slices extend a working, integrated system rather than building isolated components
that have never talked to each other.
Feedback is delayed until it is expensive to act on
A horizontal approach delays user feedback until the full feature is assembled. If the team builds
the wrong thing - misunderstands a requirement, makes a poor UX decision, or solves the wrong
problem - they discover it after weeks of work across multiple layers.
At that point, the cost of changing direction is enormous. The database schema, API contracts, and
UI components all need to be reworked. The team has already invested heavily in an approach that
turns out to be wrong.
Vertical slicing delivers feedback with every increment. The first slice ships a thin version of
the feature that real users can see. If the approach is wrong, the team discovers it after a day
or two of work, not after a month. The cost of changing direction is the cost of one small item,
not the cost of an entire feature.
It creates specialist dependencies and handoff delays
Horizontal slicing naturally leads to specialist assignment: the database expert takes the
database item, the API expert takes the API item, the frontend expert takes the frontend item.
Each person works in isolation on their layer, and the work items have dependencies between them -
the API cannot be built until the schema exists, the UI cannot be built until the API exists.
These dependencies create sequential handoffs. The database work finishes, but the API developer
is busy with something else. The API work finishes, but the frontend developer is mid-sprint on
a different feature. Each handoff introduces wait time that has nothing to do with the complexity
of the work.
Vertical slicing eliminates these dependencies. A single developer (or pair) implements the full
slice across all layers. There are no handoffs between layers because one person owns the entire
thin slice from database to UI. This also spreads knowledge - developers who work across all
layers understand the full system, not just their specialty.
Impact on continuous delivery
Continuous delivery requires a continuous flow of small, independently deployable changes.
Horizontal slicing produces the opposite: a batch of interdependent layer changes that can only
be deployed together after a separate integration phase.
A team that slices horizontally cannot deploy continuously because there is nothing to deploy
until all layers converge. They cannot get production feedback because nothing user-visible exists
until the end. They cannot limit risk because the first real test of the integrated system happens
after all the work is done.
The pipeline itself becomes less useful. When changes are horizontal slices, the pipeline can only
verify that one layer works in isolation - it cannot run meaningful end-to-end tests until all
layers exist. The pipeline gives a false green signal (“the API tests pass”) that hides the real
question (“does the feature work?”).
How to Fix It
Step 1: Learn to recognize horizontal slices (Week 1)
Before changing how the team slices, build awareness. Review the current sprint board and backlog.
For each work item, ask:
- Can a user (or another system) observe the change after this item is deployed?
- Can I write an end-to-end test for this item alone?
- Does this item deliver value without waiting for other items to be completed?
If the answer to any of these is no, the item is likely a horizontal slice. Tag these items and
count them. Most teams discover that a majority of their backlog is horizontally sliced.
Step 2: Reslice one feature vertically (Week 2)
Pick one upcoming feature and practice reslicing it. Start with the current horizontal breakdown
and convert it:
Before (horizontal):
- Create the database tables for notifications
- Build the notification API endpoints
- Build the notification preferences UI
- Integration testing for notifications
After (vertical):
- User receives an email notification when their order ships (DB + API + email + minimal UI)
- User can view notification history on their profile page
- User can disable email notifications for order updates
- User can choose between email and SMS for shipping notifications
Each vertical slice is independently deployable and testable end-to-end. Each delivers something
a user can see. The team gets feedback after item 1 instead of after item 4.
Step 3: Use the deployability test in refinement (Week 3+)
Make the deployability test a standard part of backlog refinement. For every proposed work item,
ask: “If this were the only thing we shipped this sprint, would a user notice?”
If not, the item needs reslicing. This single question catches most horizontal slices before they
enter the sprint.
Complement this with concrete acceptance criteria in Given-When-Then format. Each scenario should
describe observable behavior, not technical implementation:
- Good: “Given a registered user, when they update their email, then a verification link is sent
to the new address”
- Bad: “Build the email verification API endpoint”
Step 4: Break the specialist habit (Week 4+)
Horizontal slicing and specialist assignment reinforce each other. As long as “the backend
developer does the backend work,” slicing by layer feels natural.
Break this cycle:
- Have developers work full-stack on vertical slices. A developer who implements the entire
slice - database, API, and UI - will naturally slice vertically because they own the full
delivery.
- Pair a specialist with a generalist. If a developer is uncomfortable with a particular
layer, pair them with someone who knows it. This builds cross-layer skills while delivering
vertical slices.
- Rotate who works on what. Do not let the same person always take the database items. When
anyone can work on any layer, the team stops organizing work by layer.
Step 5: Address the objections
| Objection |
Response |
| “Our developers are specialists - they can’t work across layers” |
That is a skill gap, not a constraint. Pairing a frontend developer with a backend developer on a vertical slice builds the missing skills while delivering the work. The short-term slowdown produces long-term flexibility. |
| “The database schema needs to be designed holistically” |
Design the schema incrementally. Add the columns and tables needed for the first slice. Extend them for the second. This is how trunk-based database evolution works - backward-compatible, incremental changes. Designing the “complete” schema upfront leads to speculative design that changes anyway. |
| “Vertical slices create duplicate work across layers” |
They create less total work because integration problems are caught immediately instead of accumulating. The “duplicate” concern usually means the team is building more infrastructure than the current slice requires. Build only what the current slice needs. |
| “Some work is genuinely infrastructure” |
True infrastructure work (setting up a new database, provisioning a service) still needs to be connected to a user outcome. “Provision the notification service and send one test notification” is a vertical slice that includes the infrastructure. |
| “Our architecture makes vertical slicing hard” |
That is a signal about the architecture. Tightly coupled layers that cannot be changed independently are a deployment risk. Vertical slicing exposes this coupling early, which is better than discovering it during a high-stakes integration phase. |
Measuring Progress
| Metric |
What to look for |
| Percentage of work items that are independently deployable |
Should increase toward 100% |
| Time from feature start to first production deploy |
Should decrease as the first vertical slice ships early |
| Development cycle time |
Should decrease as items no longer wait for other layers |
| Integration issues discovered late |
Should decrease as integration happens with every slice |
| Integration frequency |
Should increase as deployable slices are completed and merged daily |
Related Content
1.1.4 - Too Much Work in Progress
Every developer is on a different story. Eight items in progress, zero done. Nothing gets the focused attention needed to finish.
Category: Team Workflow | Quality Impact: High
What This Looks Like
Open the team’s board on any given day. Count the items in progress. Now count the team members.
If the first number is significantly higher than the second, the team has a WIP problem.
Common variations:
- Everyone on a different story. A team of five has eight or more stories in progress. Nobody
is working on the same thing. The board is a wide river of half-finished work.
- Sprint-start explosion. On the first day of the sprint, every developer pulls a story. By
mid-sprint, all stories are “in progress” and none are “done.” The last day is a scramble to
close anything.
- Individual WIP hoarding. A single developer has three stories assigned: one they’re actively
coding, one waiting for review, and one blocked on a question. They count all three as “in
progress” and start nothing new - but they also don’t help anyone else finish.
- Hidden WIP. The board shows five items in progress, but each developer is also investigating
a production bug, answering questions about a previous story, and prototyping something for next
sprint. Unofficial work doesn’t appear on the board but consumes the same attention.
- Expedite as default. Urgent requests arrive mid-sprint. Instead of replacing existing work,
they stack on top. WIP grows because nothing is removed when something is added.
The telltale sign: the team is busy all the time but finishes very little. Stories take longer and
longer to complete. The sprint ends with a pile of items at 80% done.
Why This Is a Problem
High WIP is not a sign of a productive team. It is a sign of a team that has optimized for
starting work instead of finishing it. The consequences compound over time.
It destroys focus and increases context switching
Every item in progress competes for a developer’s attention. A developer working on one story can
focus deeply. A developer juggling three stories - one active, one waiting for review, one they
need to answer questions about - is constantly switching context. Research consistently shows that
each additional concurrent task reduces productive time by 20-40%.
The switching cost is not just time. It is cognitive load. Developers lose their mental model of
the code when they switch away, and it takes 15-30 minutes to rebuild it when they switch back.
Multiply this across five context switches per day and the team is spending more time reloading
context than writing code.
In a low-WIP environment, developers finish one thing before starting the next. Deep focus is the
default. Context switching is the exception, not the rule.
It inflates cycle time
Little’s Law is not a suggestion. It is a mathematical relationship: cycle time equals work in
progress divided by throughput. If a team’s throughput is roughly constant (and over weeks, it is),
the only way to reduce cycle time is to reduce WIP.
A team of five with a throughput of ten stories per sprint and five stories in progress has an
average cycle time of half a sprint. The same team with fifteen stories in progress has an average
cycle time of 1.5 sprints. The work is not getting done faster because more of it was started. It
is getting done slower because all of it is competing for the same capacity.
Long cycle times create their own problems. Feedback is delayed. Requirements go stale.
Integration conflicts accumulate. The longer a story sits in progress, the more likely it is to
need rework when it finally reaches review or testing.
It hides bottlenecks
When WIP is high, bottlenecks are invisible. If code reviews are slow, a developer just starts
another story while they wait. If the test environment is broken, they work on something else. The
constraint is never confronted because there is always more work to absorb the slack.
This is comfortable but destructive. The bottleneck does not go away because the team is working
around it. It quietly degrades the system. Reviews pile up. Test environments stay broken. The
team’s real throughput is constrained by the bottleneck, but nobody feels the pain because they
are always busy.
When WIP is limited, bottlenecks become immediately visible. A developer who cannot start new work
because the WIP limit is reached has to swarm on something blocked. “I’m idle because my PR has
been waiting for review for two hours” is a problem the team can solve. “I just started another
story while I wait” hides the same problem indefinitely.
It prevents swarming and collaboration
When every developer has their own work in progress, there is no incentive to help anyone else.
Reviewing a teammate’s pull request, pairing on a stuck story, or helping debug a failing test all
feel like distractions from “my work.” The result is that every item moves through the pipeline
alone, at the pace of a single developer.
Swarming - multiple team members working together to finish the highest-priority item - is
impossible when everyone has their own stories to protect. If you ask a developer to drop their
current story and help finish someone else’s, you are asking them to fall behind on their own work.
The incentive structure is broken.
In a low-WIP environment, finishing the team’s most important item is everyone’s job. When only
three items are in progress for a team of five, two people are available to pair, review, or
unblock. Collaboration is the natural state, not a special request.
Impact on continuous delivery
Continuous delivery requires a steady flow of small, finished changes moving through the pipeline.
High WIP produces the opposite: a large batch of unfinished changes sitting in various stages of
completion, blocking each other, accumulating merge conflicts, and stalling in review queues.
A team with fifteen items in progress does not deploy fifteen times as often as a team with one
item in progress. They deploy less frequently because nothing is fully done. Each “almost done”
story is a small batch that has not yet reached the pipeline. The batch keeps growing until
something forces a reckoning - usually the end of the sprint.
The feedback loop breaks too. When changes sit in progress for days, the developer who wrote the
code has moved on by the time the review comes back or the test fails. They have to reload context
to address feedback, which takes more time, which delays the next change, which increases WIP
further. The cycle reinforces itself.
How to Fix It
Step 1: Make WIP visible (Week 1)
Before setting any limits, make the current state impossible to ignore.
- Count every item currently in progress for the team. Include stories, bugs, spikes, and any
unofficial work that is consuming attention.
- Write this number on the board. Update it daily.
- Most teams are shocked. A team of five typically discovers 12-20 items in progress once hidden
work is included.
Do not try to fix anything yet. The goal is awareness.
Step 2: Set an initial WIP limit (Week 2)
Use the N+2 formula as a starting point, where N is the number of team members actively
working on delivery.
| Team size |
Starting WIP limit |
Why |
| 3 developers |
5 items |
One per person plus a buffer for blocked items |
| 5 developers |
7 items |
Same ratio |
| 8 developers |
10 items |
Buffer shrinks proportionally |
Add the limit to the board as a column header or policy: “In Progress (limit: 7).” Agree as a
team that when the limit is reached, nobody starts new work.
Step 3: Enforce the limit with swarming (Week 3+)
When the WIP limit is reached and a developer finishes something, they have two options:
- Pull the next highest-priority item if the WIP count is below the limit.
- Swarm on an existing item if the WIP count is at the limit.
Swarming means pairing on a stuck story, reviewing a pull request, writing a test someone needs
help with, or resolving a blocker. The key behavior change: “I have nothing to do” is never the
right response. “What can I help finish?” is.
Step 4: Lower the limit over time (Monthly)
The initial limit is a starting point. Each month, consider reducing it by one.
| Limit |
What it exposes |
| N+2 |
Gross overcommitment. Most teams find this is already a significant reduction. |
| N+1 |
Slow reviews, environment contention, unclear requirements. Team starts swarming. |
| N |
Every person on one thing. Blocked items get immediate attention. |
| Below N |
Team is pairing by default. Cycle time drops sharply. |
Each reduction will feel uncomfortable. That discomfort is the point - it exposes constraints in
the workflow that were hidden by excess WIP.
Step 5: Address the objections
Expect resistance and prepare for it:
| Objection |
Response |
| “I’ll be idle if I can’t start new work” |
Idle hands are not the problem. Idle work is. Help finish something instead of starting something new. |
| “Management will see people not typing and think we’re wasting time” |
Track cycle time and throughput. When both improve, the data speaks for itself. |
| “We have too many priorities to limit WIP” |
Having many priorities is exactly why you need a WIP limit. Without one, nothing gets the focus needed to finish. Everything is “in progress,” nothing is done. |
| “What about urgent production issues?” |
Keep one expedite slot. If a production issue arrives, it takes the slot. If the slot is full, the new issue replaces the current one. Expedite is not a way to bypass the limit - it is part of the limit. |
| “Our stories are too big to pair on” |
That is a separate problem. See Work Decomposition. Stories should be small enough that anyone can pick them up. |
Measuring Progress
| Metric |
What to look for |
| Work in progress |
Should stay at or below the team’s limit |
| Development cycle time |
Should decrease as WIP drops |
| Stories completed per week |
Should stabilize or increase despite starting fewer items |
| Time items spend blocked |
Should decrease as the team swarms on blockers |
| Sprint-end scramble |
Should disappear as work finishes continuously through the sprint |
Related Content
1.1.5 - Push-Based Work Assignment
Work is assigned to individuals by a manager or lead instead of team members pulling the next highest-priority item.
Category: Team Workflow | Quality Impact: High
What This Looks Like
A manager, tech lead, or project manager decides who works on what. Assignments happen during
sprint planning, in one-on-ones, or through tickets pre-assigned before the sprint starts. Each
team member has “their” stories for the sprint. The assignment is rarely questioned.
Common variations:
- Assignment by specialty. “You’re the database person, so you take the database stories.” Work
is routed by perceived expertise rather than team priority.
- Assignment by availability. A manager looks at who is “free” and assigns the next item from
the backlog, regardless of what the team needs finished.
- Assignment by seniority. Senior developers get the interesting or high-priority work. Junior
developers get what’s left.
- Pre-loaded sprints. Every team member enters the sprint with their work already assigned. The
sprint board is fully allocated on day one.
The telltale sign: if you ask a developer “what should you work on next?” and the answer is “I
don’t know, I need to ask my manager,” work is being pushed.
Why This Is a Problem
Push-based assignment is one of the most quietly destructive practices a team can have. It
undermines nearly every CD practice by breaking the connection between the team and the flow of
work. Each of its effects compounds the others.
It reduces quality
Push assignment makes code review feel like a distraction from “my stories.” When every developer
has their own assigned work, reviewing someone else’s pull request is time spent not making progress
on your own assignment. Reviews sit for hours or days because the reviewer is busy with their own
work. The same dynamic discourages pairing: spending an hour helping a colleague means falling
behind on your own assignments, so developers don’t offer and don’t ask.
This means fewer eyes on every change. Defects that a second person would catch in minutes survive
into production. Knowledge stays siloed because there is no reason to look at code outside your
assignment. The team’s collective understanding of the codebase narrows over time.
In a pull system, reviewing code and unblocking teammates are the highest-priority activities
because finishing the team’s work is everyone’s work. Reviews happen quickly because they are not
competing with “my stories” - they are the work. Pairing happens naturally because anyone might
pick up any story, and asking for help is how the team moves its highest-priority item forward.
It increases rework
Push assignment routes work by specialty: “You’re the database person, so you take the database
stories.” This creates knowledge silos where only one person understands a part of the system.
When the same person always works on the same area, mistakes go unreviewed by anyone with a fresh
perspective. Assumptions go unchallenged because the reviewer lacks context to question them.
Misinterpretation of requirements also increases. The assigned developer may not have context on why
a story is high priority or what business outcome it serves - they received it as an assignment, not
as a problem to solve. When the result doesn’t match what was needed, the story comes back for
rework.
In a pull system, anyone might pick up any story, so knowledge spreads across the team. Fresh eyes
catch assumptions that a domain expert would miss. Developers who pull a story engage with its
priority and purpose because they chose it from the top of the backlog. Rework drops because more
perspectives are involved earlier.
It makes delivery timelines unpredictable
Push assignment optimizes for utilization - keeping everyone busy - not for flow - getting things
done. Every developer has their own assigned work, so team WIP is the sum of all individual
assignments. There is no mechanism to say “we have too much in progress, let’s finish something
first.” WIP limits become meaningless when the person assigning work doesn’t see the full picture.
Bottlenecks are invisible because the manager assigns around them instead of surfacing them. If one
area of the system is a constraint, the assigner may not notice because they are looking at people,
not flow. In a pull system, the bottleneck becomes obvious: work piles up in one column and nobody
pulls it because the downstream step is full.
Workloads are uneven because managers cannot perfectly predict how long work will take. Some people
finish early and sit idle or start low-priority work, while others are overloaded. Feedback loops
are slow because the order of work is decided at sprint planning; if priorities change mid-sprint,
the manager must reassign. Throughput becomes erratic - some sprints deliver a lot, others very
little, with no clear pattern.
In a pull system, workloads self-balance: whoever finishes first pulls the next item. Bottlenecks
are visible. WIP limits actually work because the team collectively decides what to start. The team
automatically adapts to priority changes because the next person who finishes simply pulls whatever
is now most important.
It removes team ownership
Pull systems create shared ownership of the backlog. The team collectively cares about the priority
order because they are collectively responsible for finishing work. Push systems create individual
ownership: “that’s not my story.” When a developer finishes their assigned work, they wait for more
assignments instead of looking at what the team needs.
This extends beyond task selection. In a push system, developers stop thinking about the team’s
goals and start thinking about their own assignments. Swarming - multiple people collaborating to
finish the highest-priority item - is impossible when everyone “has their own stuff.” If a story is
stuck, the assigned developer struggles alone while teammates work on their own assignments.
The unavailability problem makes this worse. When each person works in isolation on “their” stories,
the rest of the team has no context on what that person is doing, how the work is structured, or
what decisions have been made. If the assigned person is out sick, on vacation, or leaves the
company, nobody can pick up where they left off. The work either stalls until that person returns or
another developer starts over - rereading requirements, reverse-engineering half-finished code, and
rediscovering decisions that were never shared. In a pull system, the team maintains context on
in-progress work because anyone might have pulled it, standups focus on the work rather than
individual status, and pairing spreads knowledge continuously. When someone is unavailable, the
next person simply picks up the item with enough shared context to continue.
Impact on continuous delivery
Continuous delivery depends on a steady, predictable flow of small changes through the pipeline.
Push-based assignment produces the opposite: batch-based assignment at sprint planning, uneven
bursts of activity as different developers finish at different times, blocked work sitting idle
because the assigned person is busy with something else, and no team-level mechanism for optimizing
throughput. You cannot build a continuous flow of work when the assignment model is batch-based and
individually scoped.
How to Fix It
Step 1: Order the backlog by priority (Week 1)
Before switching to a pull model, the backlog must have a clear priority order. Without it,
developers will not know what to pull next.
- Work with the product owner to stack-rank the backlog. Every item has a unique position - no
tied priorities.
- Make the priority visible. The top of the board or backlog is the most important item. There
is no ambiguity.
- Agree as a team: when you need work, you pull from the top.
Step 2: Stop pre-assigning work in sprint planning (Week 2)
Change the sprint planning conversation. Instead of “who takes this story,” the team:
- Pulls items from the top of the prioritized backlog into the sprint.
- Discusses each item enough for anyone on the team to start it.
- Leaves all items unassigned.
The sprint begins with a list of prioritized work and no assignments. This will feel uncomfortable
for the first sprint.
Step 3: Pull work daily (Week 2+)
At the daily standup (or anytime during the day), a developer who needs work:
- Looks at the sprint board.
- Checks if any in-progress item needs help (swarm first, pull second).
- If nothing needs help and the WIP limit allows, pulls the top unassigned item and assigns
themselves.
The developer picks up the highest-priority available item, not the item that matches their
specialty. This is intentional - it spreads knowledge, reduces bus factor, and keeps the team
focused on priority rather than comfort.
Step 4: Address the discomfort (Weeks 3-4)
Expect these objections and plan for them:
| Objection |
Response |
| “But only Sarah knows the payment system” |
That is a knowledge silo and a risk. Pairing Sarah with someone else on payment stories fixes the silo while delivering the work. |
| “I assigned work because nobody was pulling it” |
If nobody pulls high-priority work, that is a signal: either the team doesn’t understand the priority, the item is poorly defined, or there is a skill gap. Assignment hides the signal instead of addressing it. |
| “Some developers are faster - I need to assign strategically” |
Pull systems self-balance. Faster developers pull more items. Slower developers finish fewer but are never overloaded. The team throughput optimizes naturally. |
| “Management expects me to know who’s working on what” |
The board shows who is working on what in real time. Pull systems provide more visibility than pre-assignment because assignments are always current, not a stale plan from sprint planning. |
Step 5: Combine with WIP limits (Week 4+)
Pull-based work and WIP limits reinforce each other:
- WIP limits prevent the team from pulling too much work at once.
- Pull-based assignment ensures that when someone finishes, they pull the next priority - not
whatever the manager thinks of next.
- Together, they create a system where work flows continuously from backlog to done.
See Limiting WIP for how to set and enforce WIP limits.
What managers do instead
Moving to a pull model does not eliminate the need for leadership. It changes the focus:
| Push model (before) |
Pull model (after) |
| Decide who works on what |
Ensure the backlog is prioritized and refined |
| Balance workloads manually |
Coach the team on swarming and collaboration |
| Track individual assignments |
Track flow metrics (cycle time, WIP, throughput) |
| Reassign work when priorities change |
Update backlog priority and let the team adapt |
| Manage individual utilization |
Remove systemic blockers the team cannot resolve |
Measuring Progress
| Metric |
What to look for |
| Percentage of stories pre-assigned at sprint start |
Should drop to near zero |
| Work in progress |
Should decrease as team focuses on finishing |
| Development cycle time |
Should decrease as swarming increases |
| Stories completed per sprint |
Should stabilize or increase despite less “busyness” |
| Rework rate |
Stories returned for rework or reopened after completion - should decrease |
| Knowledge distribution |
Track who works on which parts of the system - should broaden over time |
Related Content
1.2 - Branching and Integration
Anti-patterns in how teams branch, merge, and integrate code that prevent continuous integration and delivery.
These anti-patterns affect how code flows from a developer’s machine to the shared trunk. They
create painful merges, delayed integration, and broken builds that prevent the steady stream of
small, verified changes that continuous delivery requires.
1.2.1 - Long-Lived Feature Branches
Branches that live for weeks or months, turning merging into a project in itself. The longer the branch, the bigger the risk.
Category: Branching & Integration | Quality Impact: Critical
What This Looks Like
A developer creates a branch to build a feature. The feature is bigger than expected. Days pass,
then weeks. Other developers are doing the same thing on their own branches. Trunk moves forward
while each branch diverges further from it. Nobody integrates until the feature is “done” - and
by then, the branch is hundreds or thousands of lines different from where it started.
When the merge finally happens, it is an event. The developer sets aside half a day - sometimes
more - to resolve conflicts, re-test, and fix the subtle breakages that come from combining weeks
of divergent work. Other developers delay their merges to avoid the chaos. The team’s Slack channel
lights up with “don’t merge right now, I’m resolving conflicts.” Every merge creates a window where
trunk is unstable.
Common variations:
- The “feature branch” that is really a project. A branch named
feature/new-checkout that
lasts three months. Multiple developers commit to it. It has its own bug fixes and its own
merge conflicts. It is a parallel fork of the product.
- The “I’ll merge when it’s ready” branch. The developer views the branch as a private workspace.
Merging to trunk is the last step, not a daily practice. The branch falls further behind each day
but the developer does not notice until merge day.
- The per-sprint branch. Each sprint gets a branch. All sprint work goes there. The branch is
merged at sprint end and a new one is created. Integration happens every two weeks instead of
every day.
- The release isolation branch. A branch is created weeks before a release to “stabilize” it.
Bug fixes must be applied to both the release branch and trunk. Developers maintain two streams
of work simultaneously.
- The “too risky to merge” branch. The branch has diverged so far that nobody wants to attempt
the merge. It sits for weeks while the team debates how to proceed. Sometimes it is abandoned
entirely and the work is restarted.
The telltale sign: if merging a branch requires scheduling a block of time, notifying the team, or
hoping nothing goes wrong - branches are living too long.
Why This Is a Problem
Long-lived feature branches appear safe. Each developer works in isolation, free from interference.
But that isolation is precisely the problem. It delays integration, hides conflicts, and creates
compounding risk that makes every aspect of delivery harder.
It reduces quality
When a branch lives for weeks, code review becomes a formidable task. The reviewer faces hundreds
of changed lines across dozens of files. Meaningful review is nearly impossible at that scale -
studies consistently show that review effectiveness drops sharply after 200-400 lines of change.
Reviewers skim, approve, and hope for the best. Subtle bugs, design problems, and missed edge
cases survive because nobody can hold the full changeset in their head.
The isolation also means developers make decisions in a vacuum. Two developers on separate branches
may solve the same problem differently, introduce duplicate abstractions, or make contradictory
assumptions about shared code. These conflicts are invisible until merge time, when they surface as
bugs rather than design discussions.
With short-lived branches or trunk-based development, changes are small enough for genuine review.
A 50-line change gets careful attention. Design disagreements surface within hours, not weeks. The
team maintains a shared understanding of how the codebase is evolving because they see every change
as it happens.
It increases rework
Long-lived branches guarantee merge conflicts. Two developers editing the same file on different
branches will not discover the collision until one of them merges. The second developer must then
reconcile their changes against an unfamiliar modification, often without understanding the intent
behind it. This manual reconciliation is rework in its purest form - effort spent making code work
together that would have been unnecessary if the developers had integrated daily.
The rework compounds. A developer who rebases a three-week branch against trunk may introduce
bugs during conflict resolution. Those bugs require debugging. The debugging reveals an assumption
that was valid three weeks ago but is no longer true because trunk has changed. Now the developer
must rethink and partially rewrite their approach. What should have been a day of work becomes a
week.
When developers integrate daily, conflicts are small - typically a few lines. They are resolved in
minutes with full context because both changes are fresh. The cost of integration stays constant
rather than growing exponentially with branch age.
It makes delivery timelines unpredictable
A two-day feature on a long-lived branch takes two days to build and an unknown number of days
to merge. The merge might take an hour. It might take two days. It might surface a design conflict
that requires reworking the feature. Nobody knows until they try. This makes it impossible to
predict when work will actually be done.
The queuing effect makes it worse. When several branches need to merge, they form a queue. The
first merge changes trunk, which means the second branch needs to rebase against the new trunk
before merging. If the second merge is large, it changes trunk again, and the third branch must
rebase. Each merge invalidates the work done to prepare the next one. Teams that “schedule” their
merges are admitting that integration is so costly it needs coordination.
Project managers learn they cannot trust estimates. “The feature is code-complete” does not mean
it is done - it means the merge has not started yet. Stakeholders lose confidence in the team’s
ability to deliver on time because “done” and “deployed” are separated by an unpredictable gap.
With continuous integration, there is no merge queue. Each developer integrates small changes
throughout the day. The time from “code-complete” to “integrated and tested” is minutes, not days.
Delivery dates become predictable because the integration cost is near zero.
It hides risk until the worst possible moment
Long-lived branches create an illusion of progress. The team has five features “in development,”
each on its own branch. The features appear to be independent and on track. But the risk is
hidden: none of these features have been proven to work together. The branches may contain
conflicting changes, incompatible assumptions, or integration bugs that only surface when combined.
All of that hidden risk materializes at merge time - the moment closest to the planned release
date, when the team has the least time to deal with it. A merge conflict discovered three weeks
before release is an inconvenience. A merge conflict discovered the day before release is a crisis.
Long-lived branches systematically push risk discovery to the latest possible point.
Continuous integration surfaces risk immediately. If two changes conflict, the team discovers it
within hours, while both changes are small and the authors still have full context. Risk is
distributed evenly across the development cycle instead of concentrated at the end.
Impact on continuous delivery
Continuous delivery requires that trunk is always in a deployable state and that any commit can be
released at any time. Long-lived feature branches make both impossible. Trunk cannot be deployable
if large, poorly validated merges land periodically and destabilize it. You cannot release any commit
if the latest commit is a 2,000-line merge that has not been fully tested.
Long-lived branches also prevent continuous integration - the practice of integrating every
developer’s work into trunk at least once per day. Without continuous integration, there is no
continuous delivery. The pipeline cannot provide fast feedback on changes that exist only on
private branches. The team cannot practice deploying small changes because there are no small
changes - only large merges separated by days or weeks of silence.
Every other CD practice - automated testing, pipeline automation, small batches, fast feedback -
is undermined when the branching model prevents frequent integration.
How to Fix It
Step 1: Measure your current branch lifetimes (Week 1)
Before changing anything, understand the baseline. For every open branch:
- Record when it was created and when (or if) it was last merged.
- Calculate the age in days.
- Note the number of changed files and lines.
Most teams are shocked by their own numbers. A branch they think of as “a few days old” is often
two or three weeks old. Making the data visible creates urgency.
Set a target: no branch older than one day. This will feel aggressive. That is the point.
Step 2: Set a branch lifetime limit and make it visible (Week 2)
Agree as a team on a maximum branch lifetime. Start with two days if one day feels too aggressive.
The important thing is to pick a number and enforce it.
Make the limit visible:
- Add a dashboard or report that shows branch age for every open branch.
- Flag any branch that exceeds the limit in the daily standup.
- If your CI tool supports it, add a check that warns when a branch exceeds 24 hours.
The limit creates a forcing function. Developers must either integrate quickly or break their work
into smaller pieces. Both outcomes are desirable.
Step 3: Break large features into small, integrable changes (Weeks 2-3)
The most common objection is “my feature is too big to merge in a day.” This is true when the
feature is designed as a monolithic unit. The fix is decomposition:
- Branch by abstraction. Introduce a new code path alongside the old one. Merge the new code
path in small increments. Switch over when ready.
- Feature flags. Hide incomplete work behind a toggle so it can be merged to trunk without
being visible to users.
- Keystone interface pattern. Build all the back-end work first, merge it incrementally, and
add the UI entry point last. The feature is invisible until the keystone is placed.
- Vertical slices. Deliver the feature as a series of thin, user-visible increments instead of
building all layers at once.
Each technique lets developers merge daily without exposing incomplete functionality. The feature
grows incrementally on trunk rather than in isolation on a branch.
Step 4: Adopt short-lived branches with daily integration (Weeks 3-4)
Change the team’s workflow:
- Create a branch from trunk.
- Make a small, focused change.
- Get a quick review (the change is small, so review takes minutes).
- Merge to trunk. Delete the branch.
- Repeat.
Each branch lives for hours, not days. If a branch cannot be merged by end of day, it is too
large. The developer should either merge what they have (using one of the decomposition techniques
above) or discard the branch and start smaller tomorrow.
Pair this with the team’s code review practice. Small changes enable fast reviews, and fast reviews
enable short-lived branches. The two practices reinforce each other.
Step 5: Address the objections (Weeks 3-4)
| Objection |
Response |
| “My feature takes three weeks - I can’t merge in a day” |
The feature takes three weeks. The branch does not have to. Use branch by abstraction, feature flags, or vertical slicing to merge daily while the feature grows incrementally on trunk. |
| “Merging incomplete code to trunk is dangerous” |
Incomplete code behind a feature flag or without a UI entry point is not dangerous - it is invisible. The danger is a three-week branch that lands as a single untested merge. |
| “I need my branch to keep my work separate from other changes” |
That separation is the problem. You want to discover conflicts early, when they are small and cheap to fix. A branch that hides conflicts for three weeks is not protecting you - it is accumulating risk. |
| “We tried short-lived branches and it was chaos” |
Short-lived branches require supporting practices: feature flags, good decomposition, fast CI, and a culture of small changes. Without those supports, it will feel chaotic. The fix is to build the supports, not to retreat to long-lived branches. |
| “Code review takes too long for daily merges” |
Small changes take minutes to review, not hours. If reviews are slow, that is a review process problem, not a branching problem. See PR Review Bottlenecks. |
Step 6: Continuously tighten the limit (Week 5+)
Once the team is comfortable with two-day branches, reduce the limit to one day. Then push toward
integrating multiple times per day. Each reduction surfaces new problems - features that are hard
to decompose, tests that are slow, reviews that are bottlenecked - and each problem is worth
solving because it blocks the flow of work.
The goal is continuous integration: every developer integrates to trunk at least once per day.
At that point, “branches” are just short-lived workspaces that exist for hours, and merging is
a non-event.
Measuring Progress
| Metric |
What to look for |
| Average branch lifetime |
Should decrease to under one day |
| Maximum branch lifetime |
No branch should exceed two days |
| Integration frequency |
Should increase toward at least daily per developer |
| Merge conflict frequency |
Should decrease as branches get shorter |
| Merge duration |
Should decrease from hours to minutes |
| Development cycle time |
Should decrease as integration overhead drops |
| Lines changed per merge |
Should decrease as changes get smaller |
Related Content
1.2.2 - No Continuous Integration
The build has been red for weeks and nobody cares. “CI” means a build server exists, not that anyone actually integrates continuously.
Category: Branching & Integration | Quality Impact: Critical
What This Looks Like
The team has a build server. It runs after every push. There is a dashboard somewhere that shows
build status. But the build has been red for three weeks and nobody has mentioned it. Developers
push code, glance at the result if they remember, and move on. When someone finally investigates,
the failure is in a test that broke weeks ago and nobody can remember which commit caused it.
The word “continuous” has lost its meaning. Developers do not integrate their work into trunk
daily - they work on branches for days or weeks and merge when the feature feels done. The build
server runs, but nobody treats a red build as something that must be fixed immediately. There is no
shared agreement that trunk should always be green. “CI” is a tool in the infrastructure, not a
practice the team follows.
Common variations:
- The build server with no standards. A CI server runs on every push, but there are no rules
about what happens when it fails. Some developers fix their failures. Others do not. The build
flickers between green and red all day, and nobody trusts the signal.
- The nightly build. The build runs once per day, overnight. Developers find out the next
morning whether yesterday’s work broke something. By then they have moved on to new work and
lost context on what they changed.
- The “CI” that is just compilation. The build server compiles the code and nothing else. No
tests run. No static analysis. The build is green as long as the code compiles, which tells the
team almost nothing about whether the software works.
- The manually triggered build. The build server exists, but it does not run on push. After
pushing code, the developer must log into the CI server and manually start the build and tests.
When developers are busy or forget, their changes sit untested. When multiple pushes happen
between triggers, a failure could belong to any of them. The feedback loop depends entirely on
developer discipline rather than automation.
- The branch-only build. CI runs on feature branches but not on trunk. Each branch builds in
isolation, but nobody knows whether the branches work together until merge day. Trunk is not
continuously validated.
- The ignored dashboard. The CI dashboard exists but is not displayed anywhere the team can
see it. Nobody checks it unless they are personally waiting for a result. Failures accumulate
silently.
The telltale sign: if you can ask “how long has the build been red?” and nobody knows the answer,
continuous integration is not happening.
Why This Is a Problem
Continuous integration is not a tool - it is a practice. The practice requires that every developer
integrates to a shared trunk at least once per day and that the team treats a broken build as the
highest-priority problem. Without the practice, the build server is just infrastructure generating
notifications that nobody reads.
It reduces quality
When the build is allowed to stay red, the team loses its only automated signal that something is
wrong. A passing build is supposed to mean “the software works as tested.” A failing build is
supposed to mean “stop and fix this before doing anything else.” When failures are ignored, that
signal becomes meaningless. Developers learn that a red build is background noise, not an alarm.
Once the build signal is untrusted, defects accumulate. A developer introduces a bug on Monday. The
build fails, but it was already red from an unrelated failure, so nobody notices. Another developer
introduces a different bug on Tuesday. By Friday, trunk has multiple interacting defects and nobody
knows when they were introduced or by whom. Debugging becomes archaeology.
When the team practices continuous integration, a red build is rare and immediately actionable. The
developer who broke it knows exactly which change caused the failure because they committed minutes
ago. The fix is fast because the context is fresh. Defects are caught individually, not in tangled
clusters.
It increases rework
Without continuous integration, developers work in isolation for days or weeks. Each developer
assumes their code works because it passes on their machine or their branch. But they are building
on assumptions about shared code that may already be outdated. When they finally integrate, they
discover that someone else changed an API they depend on, renamed a class they import, or modified
behavior they rely on.
The rework cascade is predictable. Developer A changes a shared interface on Monday. Developer B
builds three days of work on the old interface. On Thursday, developer B tries to integrate and
discovers the conflict. Now they must rewrite three days of code to match the new interface. If
they had integrated on Monday, the conflict would have been a five-minute fix.
Teams that integrate continuously discover conflicts within hours, not days. The rework is measured
in minutes because the conflicting changes are small and the developers still have full context on
both sides. The total cost of integration stays low and constant instead of spiking unpredictably.
It makes delivery timelines unpredictable
A team without continuous integration cannot answer the question “is the software releasable right
now?” Trunk may or may not compile. Tests may or may not pass. The last successful build may have
been a week ago. Between then and now, dozens of changes have landed without anyone verifying that
they work together.
This creates a stabilization period before every release. The team stops feature work, fixes the
build, runs the test suite, and triages failures. This stabilization takes an unpredictable amount
of time - sometimes a day, sometimes a week - because nobody knows how many problems have
accumulated since the last known-good state.
With continuous integration, trunk is always in a known state. If the build is green, the team can
release. If the build is red, the team knows exactly which commit broke it and how long ago. There
is no stabilization period because the code is continuously stabilized. Release readiness is a
fact that can be checked at any moment, not a state that must be achieved through a dedicated
effort.
It masks the true cost of integration problems
When the build is permanently broken or rarely checked, the team cannot see the patterns that would
tell them where their process is failing. Is the build slow? Nobody notices because nobody waits
for it. Are certain tests flaky? Nobody notices because failures are expected. Do certain parts of
the codebase cause more breakage than others? Nobody notices because nobody correlates failures to
changes.
These hidden problems compound. The build gets slower because nobody is motivated to speed it up.
Flaky tests multiply because nobody quarantines them. Brittle areas of the codebase stay brittle
because the feedback that would highlight them is lost in the noise.
When the team practices CI and treats a red build as an emergency, every friction point becomes
visible. A slow build annoys the whole team daily, creating pressure to optimize it. A flaky test
blocks everyone, creating pressure to fix or remove it. The practice surfaces the problems. Without
the practice, the problems are invisible and grow unchecked.
Impact on continuous delivery
Continuous integration is the foundation that every other CD practice is built on. Without it, the
pipeline cannot give fast, reliable feedback on every change. Automated testing is pointless if
nobody acts on the results. Deployment automation is pointless if the artifact being deployed has
not been validated. Small batches are pointless if the batches are never verified to work together.
A team that does not practice CI cannot practice CD. The two are not independent capabilities that
can be adopted in any order. CI is the prerequisite. Every hour that the build stays red is an
hour during which the team has no automated confidence that the software works. Continuous delivery
requires that confidence to exist at all times.
How to Fix It
Step 1: Fix the build and agree it stays green (Week 1)
Before anything else, get trunk to green. This is the team’s first and most important commitment.
- Assign the broken build as the highest-priority work item. Stop feature work if necessary.
- Triage every failure: fix it, quarantine it to a non-blocking suite, or delete the test if it
provides no value.
- Once the build is green, make the team agreement explicit: a red build is the team’s top
priority. Whoever broke it fixes it. If they cannot fix it within 15 minutes, they revert
their change and try again with a smaller commit.
Write this agreement down. Put it in the team’s working agreements document. If you do not have
one, start one now. The agreement is simple: we do not commit on top of a red build, and we do not
leave a red build for someone else to fix.
Step 2: Make the build visible (Week 1)
The build status must be impossible to ignore:
- Display the build dashboard on a large monitor visible to the whole team.
- Configure notifications so that a broken build alerts the team immediately - in the team chat
channel, not in individual email inboxes.
- If the build breaks, the notification should identify the commit and the committer.
Visibility creates accountability. When the whole team can see that the build broke at 2:15 PM
and who broke it, social pressure keeps people attentive. When failures are buried in email
notifications, they are easily ignored.
Step 3: Require integration at least once per day (Week 2)
The “continuous” in continuous integration means at least daily, and ideally multiple times per day.
Set the expectation:
- Every developer integrates their work to trunk at least once per day.
- If a developer has been working on a branch for more than a day without integrating, that is a
problem to discuss at standup.
- Track integration frequency per developer
per day. Make it visible alongside the build dashboard.
This will expose problems. Some developers will say their work is not ready to integrate. That is a
decomposition problem - the work is too large. Some will say they cannot integrate because the build
is too slow. That is a pipeline problem. Each problem is worth solving. See
Long-Lived Feature Branches for techniques to break large work
into daily integrations.
Step 4: Make the build fast enough to provide useful feedback (Weeks 2-3)
A build that takes 45 minutes is a build that developers will not wait for. Target under 10
minutes for the primary feedback loop:
- Identify the slowest stages and optimize or parallelize them.
- Move slow integration tests to a secondary pipeline that runs after the fast suite passes.
- Add build caching so that unchanged dependencies are not recompiled on every run.
- Run tests in parallel if they are not already.
The goal is a fast feedback loop: the developer pushes, waits a few minutes, and knows whether
their change works with everything else. If they have to wait 30 minutes, they will context-switch,
and the feedback loop breaks.
Step 5: Address the objections (Weeks 3-4)
| Objection |
Response |
| “The build is too slow to fix every red immediately” |
Then the build is too slow, and that is a separate problem to solve. A slow build is not a reason to ignore failures - it is a reason to invest in making the build faster. |
| “Some tests are flaky - we can’t treat every failure as real” |
Quarantine flaky tests into a non-blocking suite. The blocking suite must be deterministic. If a test in the blocking suite fails, it is real until proven otherwise. |
| “We can’t integrate daily - our features take weeks” |
The features take weeks. The integrations do not have to. Use branch by abstraction, feature flags, or vertical slicing to integrate partial work daily. |
| “Fixing someone else’s broken build is not my job” |
It is the whole team’s job. A red build blocks everyone. If the person who broke it is unavailable, someone else should revert or fix it. The team owns the build, not the individual. |
| “We have CI - the build server runs on every push” |
A build server is not CI. CI is the practice of integrating frequently and keeping the build green. If the build has been red for a week, you have a build server, not continuous integration. |
Step 6: Build the habit (Week 4+)
Continuous integration is a daily discipline, not a one-time setup. Reinforce the habit:
- Review integration frequency in retrospectives. If it is dropping, ask why.
- Celebrate streaks of consecutive green builds. Make it a point of team pride.
- When a developer reverts a broken commit quickly, recognize it as the right behavior - not as a
failure.
- Periodically audit the build: is it still fast? Are new flaky tests creeping in? Is the test
coverage meaningful?
The goal is a team culture where a red build feels wrong - like an alarm that demands immediate
attention. When that instinct is in place, CI is no longer a process being followed. It is how
the team works.
Measuring Progress
| Metric |
What to look for |
| Build pass rate |
Percentage of builds that pass on first run - should be above 95% |
| Time to fix a broken build |
Should be under 15 minutes, with revert as the fallback |
| Integration frequency |
At least one integration per developer per day |
| Build duration |
Should be under 10 minutes for the primary feedback loop |
| Longest period with a red build |
Should be measured in minutes, not hours or days |
| Development cycle time |
Should decrease as integration overhead drops and stabilization periods disappear |
Related Content
1.3 - Testing
Anti-patterns in test strategy, test architecture, and quality practices that block continuous delivery.
These anti-patterns affect how teams build confidence that their code is safe to deploy. They
create slow pipelines, flaky feedback, and manual gates that prevent the continuous flow of
changes to production.
1.3.1 - No Test Automation
Zero automated tests. The team has no idea where to start and the codebase was not designed for testability.
Category: Testing & Quality | Quality Impact: Critical
What This Looks Like
The team deploys by manually verifying things work. Someone clicks through the application, checks
a few screens, and declares it good. There is no test suite. No test runner configured. No test
directory in the repository. The CI server, if one exists, builds the code and stops there.
When a developer asks “how do I know if my change broke something?” the answer is either “you
don’t” or “someone from QA will check it.” Bugs discovered in production are treated as inevitable.
Nobody connects the lack of automated tests to the frequency of production incidents because there
is no baseline to compare against.
Common variations:
- Tests exist but are never run. Someone wrote tests a year ago. The test suite is broken and
nobody has fixed it. The tests are checked into the repository but are not part of any pipeline
or workflow.
- Manual test scripts as the safety net. A spreadsheet or wiki page lists hundreds of manual
test cases. Before each release, someone walks through them by hand. The process takes days. It
is the only verification the team has.
- Testing is someone else’s job. Developers write code. A separate QA team tests it days or
weeks later. The feedback loop is so long that developers have moved on to other work by the
time defects are found.
- “The code is too legacy to test.” The team has decided the codebase is untestable.
Functions are thousands of lines long, everything depends on global state, and there are no
seams where test doubles could be inserted. This belief becomes self-fulfilling - nobody tries
because everyone agrees it is impossible.
The telltale sign: when a developer makes a change, the only way to verify it works is to deploy
it and see what happens.
Why This Is a Problem
Without automated tests, every change is a leap of faith. The team has no fast, reliable way to
know whether code works before it reaches users. Every downstream practice that depends on
confidence in the code - continuous integration, automated deployment, frequent releases - is
blocked.
It reduces quality
When there are no automated tests, defects are caught by humans or by users. Humans are slow,
inconsistent, and unable to check everything. A manual tester cannot verify 500 behaviors in an
hour, but an automated suite can. The behaviors that are not checked are the ones that break.
Developers writing code without tests have no feedback on whether their logic is correct until
someone else exercises it. A function that handles an edge case incorrectly will not be caught
until a user hits that edge case in production. By then, the developer has moved on and lost
context on the code they wrote.
With even a basic suite of automated tests, developers get feedback in minutes. They catch their
own mistakes while the code is fresh. The suite runs the same checks every time, never forgetting
an edge case and never getting tired.
It increases rework
Without tests, rework comes from two directions. First, bugs that reach production must be
investigated, diagnosed, and fixed - work that an automated test would have prevented. Second,
developers are afraid to change existing code because they have no way to verify they have not
broken something. This fear leads to workarounds: copy-pasting code instead of refactoring,
adding conditional branches instead of restructuring, and building new modules alongside old ones
instead of modifying what exists.
Over time, the codebase becomes a patchwork of workarounds layered on workarounds. Each change
takes longer because the code is harder to understand and more fragile. The absence of tests is
not just a testing problem - it is a design problem that compounds with every change.
Teams with automated tests refactor confidently. They rename functions, extract modules, and
simplify logic knowing that the test suite will catch regressions. The codebase stays clean
because changing it is safe.
It makes delivery timelines unpredictable
Without automated tests, the time between “code complete” and “deployed” is dominated by manual
verification. How long that verification takes depends on how many changes are in the batch, how
available the testers are, and how many defects they find. None of these variables are predictable.
A change that a developer finishes on Monday might not be verified until Thursday. If defects are
found, the cycle restarts. Lead time from commit to production is measured in weeks, and the
variance is enormous. Some changes take three days, others take three weeks, and the team cannot
predict which.
Automated tests collapse the verification step to minutes. The time from “code complete” to
“verified” becomes a constant, not a variable. Lead time becomes predictable because the largest
source of variance has been removed.
Impact on continuous delivery
Automated tests are the foundation of continuous delivery. Without them, there is no automated
quality gate. Without an automated quality gate, there is no safe way to deploy frequently.
Without frequent deployment, there is no fast feedback from production. Every CD practice assumes
that the team can verify code quality automatically. A team with no test automation is not on a
slow path to CD - they have not started.
How to Fix It
Starting test automation on an untested codebase feels overwhelming. The key is to start small,
establish the habit, and expand coverage incrementally. You do not need to test everything before
you get value - you need to test something and keep going.
Step 1: Set up the test infrastructure (Week 1)
Before writing a single test, make it trivially easy to run tests:
- Choose a test framework for your primary language. Pick the most popular one - do not
deliberate.
- Add the framework to the project. Configure it. Write a single test that asserts
true == true
and verify it passes.
- Add a
test script or command to the project so that anyone can run the suite with a single
command (e.g., npm test, pytest, mvn test).
- Add the test command to the CI pipeline so that tests run on every push.
The goal for week one is not coverage. It is infrastructure: a working test runner in the pipeline
that the team can build on.
Step 2: Write tests for every new change (Week 2+)
Establish a team rule: every new change must include at least one automated test. Not “every new
feature” - every change. Bug fixes get a regression test that fails without the fix and passes
with it. New functions get a test that verifies the core behavior. Refactoring gets a test that
pins the existing behavior before changing it.
This rule is more important than retroactive coverage. New code enters the codebase tested. The
tested portion grows with every commit. After a few months, the most actively changed code has
coverage, which is exactly where coverage matters most.
Step 3: Target high-change areas for retroactive coverage (Weeks 3-6)
Use your version control history to find the files that change most often. These are the files
where bugs are most likely and where tests provide the most value:
- List the 10 files with the most commits in the last six months.
- For each file, write tests for its core public behavior. Do not try to test every line - test
the functions that other code depends on.
- If the code is hard to test because of tight coupling, wrap it. Create a thin adapter around
the untestable code and test the adapter. This is the Strangler Fig pattern applied to testing.
Step 4: Make untestable code testable incrementally (Weeks 4-8)
If the codebase resists testing, introduce seams one at a time:
| Problem |
Technique |
| Function does too many things |
Extract the pure logic into a separate function and test that |
| Hard-coded database calls |
Introduce a repository interface, inject it, test with a fake |
| Global state or singletons |
Pass dependencies as parameters instead of accessing globals |
| No dependency injection |
Start with “poor man’s DI” - default parameters that can be overridden in tests |
You do not need to refactor the entire codebase. Each time you touch a file, leave it slightly
more testable than you found it.
Step 5: Set a coverage floor and ratchet it up (Week 6+)
Once you have meaningful coverage in actively changed code, set a coverage threshold in the
pipeline:
- Measure current coverage. Say it is 15%.
- Set the pipeline to fail if coverage drops below 15%.
- Every two weeks, raise the floor by 2-5 percentage points.
The floor prevents backsliding. The ratchet ensures progress. The team does not need to hit 90%
coverage - they need to ensure that coverage only goes up.
| Objection |
Response |
| “The codebase is too legacy to test” |
You do not need to test the legacy code directly. Wrap it in testable adapters and test those. Every new change gets a test. Coverage grows from the edges inward. |
| “We don’t have time to write tests” |
You are already spending that time on manual verification and production debugging. Tests shift that cost to the left where it is cheaper. Start with one test per change - the overhead is minutes, not hours. |
| “We need to test everything before it’s useful” |
One test that catches one regression is more useful than zero tests. The value is immediate and cumulative. You do not need full coverage to start getting value. |
| “Developers don’t know how to write tests” |
Pair a developer who has testing experience with one who does not. If nobody on the team has experience, invest one day in a testing workshop. The skill is learnable in a week. |
Measuring Progress
| Metric |
What to look for |
| Test count |
Should increase every sprint |
| Code coverage of actively changed files |
More meaningful than overall coverage - focus on files changed in the last 30 days |
| Build duration |
Should increase slightly as tests are added, but stay under 10 minutes |
| Defects found in production vs. in tests |
Ratio should shift toward tests over time |
| Change fail rate |
Should decrease as test coverage catches regressions before deployment |
| Manual testing effort per release |
Should decrease as automated tests replace manual verification |
Related Content
1.3.2 - Manual Regression Testing Gates
Every release requires days or weeks of manual testing. Testers execute scripted test cases. Test effort scales linearly with application size.
Category: Testing & Quality | Quality Impact: Critical
What This Looks Like
Before every release, the team enters a testing phase. Testers open a spreadsheet or test
management tool containing hundreds of scripted test cases. They walk through each one by hand:
click this button, enter this value, verify this result. The testing takes days. Sometimes it takes
weeks. Nothing ships until every case is marked pass or fail, and every failure is triaged.
Developers stop working on new features during this phase because testers need a stable build to
test against. Code freezes go into effect. Bug fixes discovered during testing must be applied
carefully to avoid invalidating tests that have already passed. The team enters a holding pattern
where the only work that matters is getting through the test cases.
The testing effort grows with every release. New features add new test cases, but old test cases
are rarely removed because nobody is confident they are redundant. A team that tested for three
days six months ago now tests for five. The spreadsheet has 800 rows. Every release takes longer
to validate than the last.
Common variations:
- The regression spreadsheet. A master spreadsheet of every test case the team has ever
written. Before each release, a tester works through every row. The spreadsheet is the
institutional memory of what the software is supposed to do, and nobody trusts anything else.
- The dedicated test phase. The sprint cadence is two weeks of development followed by one week
of testing. The test week is a mini-waterfall phase embedded in an otherwise agile process.
Nothing can ship until the test phase is complete.
- The test environment bottleneck. Manual testing requires a specific environment that is shared
across teams. The team must wait for their slot. When the environment is broken by another team’s
testing, everyone waits for it to be restored.
- The sign-off ceremony. A QA lead or manager must personally verify a subset of critical paths
and sign a document before the release can proceed. If that person is on vacation, the release
waits.
- The compliance-driven test cycle. Regulatory requirements are interpreted as requiring manual
execution of every test case with documented evidence. Each test run produces screenshots and
sign-off forms. The documentation takes as long as the testing itself.
The telltale sign: if the question “can we release today?” is always answered with “not until QA
finishes,” manual regression testing is gating your delivery.
Why This Is a Problem
Manual regression testing feels responsible. It feels thorough. But it creates a bottleneck that
grows worse with every feature the team builds, and the thoroughness it promises is an illusion.
It reduces quality
Manual testing is less reliable than it appears. A human executing the same test case for the
hundredth time will miss things. Attention drifts. Steps get skipped. Edge cases that seemed
important when the test was written get glossed over when the tester is on row 600 of a
spreadsheet. Studies on manual testing consistently show that testers miss 15-30% of defects
that are present in the software they are testing.
The test cases themselves decay. They were written for the version of the software that existed
when the feature shipped. As the product evolves, some cases become irrelevant, others become
incomplete, and nobody updates them systematically. The team is executing a test plan that
partially describes software that no longer exists.
The feedback delay compounds the quality problem. A developer who wrote code two weeks ago gets
a bug report from a tester during the regression cycle. The developer has lost context on the
change. They re-read their own code, try to remember what they were thinking, and fix the bug
with less confidence than they would have had the day they wrote it.
Automated tests catch the same classes of bugs in seconds, with perfect consistency, every time
the code changes. They do not get tired on row 600. They do not skip steps. They run against the
current version of the software, not a test plan written six months ago. And they give feedback
immediately, while the developer still has full context.
It increases rework
The manual testing gate creates a batch-and-queue cycle. Developers write code for two weeks, then
testers spend a week finding bugs in that code. Every bug found during the regression cycle is
rework: the developer must stop what they are doing, reload the context of a completed story,
diagnose the issue, fix it, and send it back to the tester for re-verification. The re-verification
may invalidate other test cases, requiring additional re-testing.
The batch size amplifies the rework. When two weeks of changes are tested together, a bug could be
in any of dozens of commits. Narrowing down the cause takes longer because there are more
variables. When the same bug would have been caught by an automated test minutes after it was
introduced, the developer would have fixed it in the same sitting - one context switch instead of
many.
The rework also affects testers. A bug fix during the regression cycle means the tester must re-run
affected test cases. If the fix changes behavior elsewhere, the tester must re-run those cases too.
A single bug fix can cascade into hours of re-testing, pushing the release date further out.
With automated regression tests, bugs are caught as they are introduced. The fix happens
immediately. There is no regression cycle, no re-testing cascade, and no context-switching penalty.
It makes delivery timelines unpredictable
The regression testing phase takes as long as it takes. The team cannot predict how many bugs the
testers will find, how long each fix will take, or how much re-testing the fixes will require. A
release planned for Friday might slip to the following Wednesday. Or the following Friday.
This unpredictability cascades through the organization. Product managers cannot commit to delivery
dates because they do not know how long testing will take. Stakeholders learn to pad their
expectations. “We’ll release in two weeks” really means “we’ll release in two to four weeks,
depending on what QA finds.”
The unpredictability also creates pressure to cut corners. When the release is already three days
late, the team faces a choice: re-test thoroughly after a late bug fix, or ship without full
re-testing. Under deadline pressure, most teams choose the latter. The manual testing gate that
was supposed to ensure quality becomes the reason quality is compromised.
Automated regression suites produce predictable, repeatable results. The suite runs in the same
amount of time every time. There is no testing phase to slip. The team knows within minutes of
every commit whether the software is releasable.
It creates a permanent scaling problem
Manual testing effort scales linearly with application size. Every new feature adds test cases.
The test suite never shrinks. A team that takes three days to test today will take four days in
six months and five days in a year. The testing phase consumes an ever-growing fraction of the
team’s capacity.
This scaling problem is invisible at first. Three days of testing feels manageable. But the growth
is relentless. The team that started with 200 test cases now has 800. The test phase that was two
days is now a week. And because the test cases were written by different people at different times,
nobody can confidently remove any of them without risking a missed regression.
Automated tests scale differently. Adding a new automated test adds milliseconds to the suite
duration, not hours to the testing phase. A team with 10,000 automated tests runs them in the same
10 minutes as a team with 1,000. The cost of confidence is fixed, not linear.
Impact on continuous delivery
Manual regression testing is fundamentally incompatible with continuous delivery. CD requires that
any commit can be released at any time. A manual testing gate that takes days means the team can
release at most once per testing cycle. If the gate takes a week, the team releases at most every
two or three weeks - regardless of how fast their pipeline is or how small their changes are.
The manual gate also breaks the feedback loop that CD depends on. CD gives developers confidence
that their change works by running automated checks within minutes. A manual gate replaces that
fast feedback with a slow, batched, human process that cannot keep up with the pace of development.
You cannot have continuous delivery with a manual regression gate. The two are mutually exclusive.
The gate must be automated before CD is possible.
How to Fix It
Step 1: Catalog your manual test cases and categorize them (Week 1)
Before automating anything, understand what the manual test suite actually covers. For every test
case in the regression suite:
- Identify what behavior it verifies.
- Classify it: is it testing business logic, a UI flow, an integration boundary, or a compliance
requirement?
- Rate its value: has this test ever caught a real bug? When was the last time?
- Rate its automation potential: can this be tested at a lower level (unit, functional, API)?
Most teams discover that a large percentage of their manual test cases are either redundant (the
same behavior is tested multiple times), outdated (the feature has changed), or automatable at a
lower level.
Step 2: Automate the highest-value cases first (Weeks 2-4)
Pick the 20 test cases that cover the most critical paths - the ones that would cause the most
damage if they regressed. Automate them:
- Business logic tests become unit tests.
- API behavior tests become functional tests.
- Critical user journeys become a small set of E2E smoke tests.
Do not try to automate everything at once. Start with the cases that give the most confidence per
minute of execution time. The goal is to build a fast automated suite that covers the riskiest
scenarios so the team no longer depends on manual execution for those paths.
Step 3: Run automated tests in the pipeline on every commit (Week 3)
Move the new automated tests into the CI pipeline so they run on every push. This is the critical
shift: testing moves from a phase at the end of development to a continuous activity that happens
with every change.
Every commit now gets immediate feedback on the critical paths. If a regression is introduced, the
developer knows within minutes - not weeks.
Step 4: Shrink the manual suite as automation grows (Weeks 4-8)
Each week, pick another batch of manual test cases and either automate or retire them:
- Automate cases where the behavior is stable and testable at a lower level.
- Retire cases that are redundant with existing automated tests or that test behavior that no
longer exists.
- Keep manual only for genuinely exploratory testing that requires human judgment - usability
evaluation, visual design review, or complex workflows that resist automation.
Track the shrinkage. If the manual suite had 800 cases and now has 400, that is progress. If the
manual testing phase took five days and now takes two, that is measurable improvement.
Step 5: Replace the testing phase with continuous testing (Weeks 6-8+)
The goal is to eliminate the dedicated testing phase entirely:
| Before |
After |
| Code freeze before testing |
No code freeze - trunk is always testable |
| Testers execute scripted cases |
Automated suite runs on every commit |
| Bugs found days or weeks after coding |
Bugs found minutes after coding |
| Testing phase blocks release |
Release readiness checked automatically |
| QA sign-off required |
Pipeline pass is the sign-off |
| Testers do manual regression |
Testers do exploratory testing, write automated tests, and improve test infrastructure |
Step 6: Address the objections (Ongoing)
| Objection |
Response |
| “Automated tests can’t catch everything a human can” |
Correct. But humans cannot execute 800 test cases reliably in a day, and automated tests can. Automate the repeatable checks and free humans for the exploratory testing where their judgment adds value. |
| “We need manual testing for compliance” |
Most compliance frameworks require evidence that testing was performed, not that humans performed it. Automated test reports with pass/fail results, timestamps, and traceability to requirements satisfy most audit requirements better than manual spreadsheets. Confirm with your compliance team. |
| “Our testers don’t know how to write automated tests” |
Pair testers with developers. The tester contributes domain knowledge - what to test and why - while the developer contributes automation skills. Over time, the tester learns automation and the developer learns testing strategy. |
| “We can’t automate tests for our legacy system” |
Start with new code. Every new feature gets automated tests. For legacy code, automate the most critical paths first and expand coverage as you touch each area. The legacy system does not need 100% automation overnight. |
| “What if we automate a test wrong and miss a real bug?” |
Manual tests miss real bugs too - consistently. An automated test that is wrong can be fixed once and stays fixed. A manual tester who skips a step makes the same mistake next time. Automation is not perfect, but it is more reliable and more improvable than manual execution. |
Measuring Progress
| Metric |
What to look for |
| Manual test case count |
Should decrease steadily as cases are automated or retired |
| Manual testing phase duration |
Should shrink toward zero |
| Automated test count in pipeline |
Should grow as manual cases are converted |
| Release frequency |
Should increase as the manual gate shrinks |
| Development cycle time |
Should decrease as the testing phase is eliminated |
| Time from code complete to release |
Should converge toward pipeline duration, not testing phase duration |
Related Content
1.3.3 - Flaky Test Suites
Tests randomly pass or fail. Developers rerun the pipeline until it goes green. Nobody trusts the test suite to tell them anything useful.
Category: Testing & Quality | Quality Impact: High
What This Looks Like
A developer pushes a change. The pipeline fails. They look at the failure - it is a test they did
not touch, in a module they did not change. They click “rerun.” It passes. They merge.
This happens multiple times a day across the team. Nobody investigates failures on the first
occurrence because the odds favor flakiness over a real problem. When someone mentions a test
failure in standup, the first question is “did you rerun it?” not “what broke?”
Common variations:
- The nightly lottery. The full suite runs overnight. Every morning, a different random subset
of tests is red. Someone triages the failures, marks most as flaky, and the team moves on. Real
regressions hide in the noise.
- The retry-until-green pattern. The pipeline configuration automatically reruns failed tests
two or three times. If a test passes on any attempt, it counts as passed. The team considers
this solved. In reality, it masks failures and doubles or triples pipeline duration.
- The “known flaky” tag. Tests are annotated with a skip or known-flaky marker. The suite
ignores them. The list grows over time. Nobody goes back to fix them because they are out of
sight.
- Environment-dependent failures. Tests pass on developer machines but fail in CI, or pass in
CI but fail on Tuesdays. The failures correlate with shared test environments, time-of-day
load patterns, or external service availability.
- Test order dependency. Tests pass when run in a specific order but fail when run in
isolation or in a different sequence. Shared mutable state from one test leaks into another.
The telltale sign: the team has a shared understanding that the first pipeline failure “doesn’t
count.” Rerunning the pipeline is a routine step, not an exception.
Why This Is a Problem
Flaky tests are not a minor annoyance. They systematically destroy the value of the test suite by
making it impossible to distinguish signal from noise. A test suite that sometimes lies is worse
than no test suite at all, because it creates an illusion of safety.
It reduces quality
When tests fail randomly, developers stop trusting them. The rational response to a flaky suite
is to ignore failures - and that is exactly what happens. A developer whose pipeline fails three
times a week for reasons unrelated to their code learns to click “rerun” without reading the
error message.
This behavior is invisible most of the time. It becomes catastrophic when a real regression
happens. The test that catches the regression fails, the developer reruns because “it’s probably
flaky,” it passes on the second run because the flaky behavior went the other way, and the
regression ships to production. The test did its job, but the developer’s trained behavior
neutralized it.
In a suite with zero flaky tests, every failure demands investigation. Developers read the error,
find the cause, and fix it. Failures are rare and meaningful. The suite functions as a reliable
quality gate.
It increases rework
Flaky tests cause rework in two ways. First, developers spend time investigating failures that
turn out to be noise. A developer sees a test failure, spends 20 minutes reading the error and
reviewing their change, realizes the failure is unrelated, and reruns. Multiply this by every
developer on the team, multiple times per day.
Second, the retry-until-green pattern extends pipeline duration. A pipeline that should take 8
minutes takes 20 because failed tests are rerun automatically. Developers wait longer for
feedback, context-switch more, and lose more time to task switching.
Teams with deterministic test suites waste zero time investigating flaky failures. Their pipeline
runs once, gives an answer, and the developer acts on it.
It makes delivery timelines unpredictable
A flaky suite introduces randomness into the delivery process. The same code, submitted twice,
might pass the pipeline on the first attempt or take three reruns. Lead time from commit to merge
varies not because of code quality but because of test noise.
When the team needs to ship urgently, flaky tests become a source of anxiety. “Will the pipeline
pass this time?” The team starts planning around the flakiness - running the pipeline early “in
case it fails,” avoiding changes late in the day because there might not be time for reruns. The
delivery process is shaped by the unreliability of the tests rather than by the quality of the
code.
Deterministic tests make delivery time a function of code quality alone. The pipeline is a
predictable step that takes the same amount of time every run. There are no surprises.
It normalizes ignoring failures
The most damaging effect of flaky tests is cultural. Once a team accepts that test failures are
often noise, the standard for investigating failures drops permanently. New team members learn
from day one that “you just rerun it.” The bar for adding a flaky test to the suite is low
because one more flaky test is barely noticeable when there are already dozens.
This normalization extends beyond tests. If the team tolerates unreliable automated checks, they
will tolerate unreliable monitoring, unreliable alerts, and unreliable deploys. Flaky tests teach
the team that automation is not trustworthy - exactly the opposite of what CD requires.
Impact on continuous delivery
Continuous delivery depends on automated quality gates that the team trusts completely. A flaky
suite is a quality gate with a broken lock - it looks like it is there, but it does not actually
stop anything. Developers bypass it by rerunning. Regressions pass through it by luck.
The pipeline must be a machine that answers one question with certainty: “Is this change safe to
deploy?” A flaky suite answers “probably, maybe, rerun and ask again.” That is not a foundation
you can build continuous delivery on.
How to Fix It
Step 1: Measure the flakiness (Week 1)
Before fixing anything, quantify the problem:
- Collect pipeline run data for the last 30 days. Count the number of runs that failed and were
rerun without code changes.
- Identify which specific tests failed across those reruns. Rank them by failure frequency.
- Calculate the pipeline pass rate: what percentage of first-attempt runs succeed?
This gives you a hit list and a baseline. If your first-attempt pass rate is 60%, you know 40% of
pipeline runs are wasted on flaky failures.
Step 2: Quarantine the worst offenders (Week 1)
Take the top 10 flakiest tests and move them out of the pipeline-gating suite immediately. Do not
fix them yet - just remove them from the critical path.
- Move them to a separate test suite that runs on a schedule (nightly or hourly) but does not
block merges.
- Create a tracking issue for each quarantined test with its failure rate and the suspected cause.
This immediately improves pipeline reliability. The team sees fewer false failures on day one.
Step 3: Fix or replace quarantined tests (Weeks 2-4)
Work through the quarantined tests systematically. For each one, identify the root cause:
| Root cause |
Fix |
| Shared mutable state (database, filesystem, cache) |
Isolate test data. Each test creates and destroys its own state. Use transactions or test containers. |
| Timing dependencies (sleep, setTimeout, polling) |
Replace time-based waits with event-based waits. Wait for a condition, not a duration. |
| Test order dependency |
Ensure each test is self-contained. Run tests in random order to surface hidden dependencies. |
| External service dependency |
Replace with a test double. Validate the double with a contract test. |
| Race conditions in async code |
Use deterministic test patterns. Await promises. Avoid fire-and-forget in test code. |
| Resource contention (ports, files, shared environments) |
Allocate unique resources per test. Use random ports. Use temp directories. |
For each quarantined test, either fix it and return it to the gating suite or replace it with a
deterministic lower-level test that covers the same behavior.
Step 4: Prevent new flaky tests from entering the suite (Week 3+)
Establish guardrails so the problem does not recur:
- Run new tests 10 times in CI before merging them. If any run fails, the test is flaky and must
be fixed before it enters the suite.
- Run the full suite in random order. This surfaces order-dependent tests immediately.
- Track the pipeline first-attempt pass rate as a team metric. Make it visible on a dashboard.
Set a target (e.g., 95%) and treat drops below the target as incidents.
- Add a team working agreement: flaky tests are treated as bugs with the same priority as
production defects.
Step 5: Eliminate automatic retries (Week 4+)
If the pipeline is configured to automatically retry failed tests, turn it off. Retries mask
flakiness instead of surfacing it. Once the quarantine and prevention steps are in place, the
suite should be reliable enough to run once.
If a test fails, it should mean something. Retries teach the team that failures are meaningless.
| Objection |
Response |
| “Retries are fine - they handle transient issues” |
Transient issues in a test suite are a symptom of external dependencies or shared state. Fix the root cause instead of papering over it with retries. |
| “We don’t have time to fix flaky tests” |
Calculate the time the team spends rerunning pipelines and investigating false failures. It is almost always more than the time to fix the flaky tests. |
| “Some flakiness is inevitable with E2E tests” |
That is an argument for fewer E2E tests, not for tolerating flakiness. Push the test down to a level where it can be deterministic. |
| “The flaky test sometimes catches real bugs” |
A test that catches real bugs 5% of the time and false-alarms 20% of the time is a net negative. Replace it with a deterministic test that catches the same bugs 100% of the time. |
Measuring Progress
| Metric |
What to look for |
| Pipeline first-attempt pass rate |
Should climb toward 95%+ |
| Number of quarantined tests |
Should decrease to zero as tests are fixed or replaced |
| Pipeline reruns per week |
Should drop to near zero |
| Build duration |
Should decrease as retries are removed |
| Development cycle time |
Should decrease as developers stop waiting for reruns |
| Developer trust survey |
Ask quarterly: “Do you trust the test suite to catch real problems?” Answers should improve. |
Related Content
1.3.4 - Inverted Test Pyramid
Most tests are slow end-to-end or UI tests. Few unit tests. The test suite is slow, brittle, and expensive to maintain.
Category: Testing & Quality | Quality Impact: High
What This Looks Like
The team has tests, but the wrong kind. Running the full suite takes 30 minutes or more. Tests
fail randomly. Developers rerun the pipeline and hope for green. When a test fails, the first
question is “is that a real failure or a flaky test?” rather than “what did I break?”
Common variations:
- The ice cream cone. Most testing is manual. Below that, a large suite of end-to-end browser
tests. A handful of integration tests. Almost no unit tests. The manual testing takes days, the
E2E suite takes hours, and nothing runs fast enough to give developers feedback while they code.
- The E2E-first approach. The team believes end-to-end tests are “real” tests because they
test the “whole system.” Unit tests are dismissed as “not testing anything useful” because they
use mocks. The result is a suite of 500 Selenium tests that take 45 minutes and fail 10% of
the time.
- The integration test swamp. Every test boots a real database, calls real services, and
depends on shared test environments. Tests are slow because they set up and tear down heavy
infrastructure. They are flaky because they depend on network availability and shared mutable
state.
- The UI test obsession. The team writes tests exclusively through the UI layer. Business
logic that could be verified in milliseconds with a unit test is instead tested through a
full browser automation flow that takes seconds per assertion.
- The “we have coverage” illusion. Code coverage is high because the E2E tests exercise most
code paths. But the tests are so slow and brittle that developers do not run them locally. They
push code and wait 40 minutes to learn if it works. If a test fails, they assume it is flaky
and rerun.
The telltale sign: developers do not trust the test suite. They push code and go get coffee. When
tests fail, they rerun before investigating. When a test is red for days, nobody is alarmed.
Why This Is a Problem
An inverted test pyramid does not just slow the team down. It actively undermines every benefit
that testing is supposed to provide.
The suite is too slow to give useful feedback
The purpose of a test suite is to tell developers whether their change works - fast enough that
they can act on the feedback while they still have context. A suite that runs in seconds gives
feedback during development. A suite that runs in minutes gives feedback before the developer
moves on. A suite that runs in 30 or more minutes gives feedback after the developer has started
something else entirely.
When the suite takes 40 minutes, developers do not run it locally. They push to CI and context-
switch to a different task. When the result comes back, they have lost the mental model of the
code they changed. Investigating a failure takes longer because they have to re-read their own
code. Fixing the failure takes longer because they are now juggling two streams of work.
A well-structured suite - heavy on unit tests, light on E2E - runs in under 10 minutes. Developers
run it locally before pushing. Failures are caught while the code is still fresh. The feedback
loop is tight enough to support continuous integration.
Flaky tests destroy trust
End-to-end tests are inherently non-deterministic. They depend on network connectivity, shared
test environments, external service availability, browser rendering timing, and dozens of other
factors outside the developer’s control. A test that fails because a third-party API was slow for
200 milliseconds looks identical to a test that fails because the code is wrong.
When 10% of the suite fails randomly on any given run, developers learn to ignore failures. They
rerun the pipeline, and if it passes the second time, they assume the first failure was noise.
This behavior is rational given the incentives, but it is catastrophic for quality. Real failures
hide behind the noise. A test that detects a genuine regression gets rerun and ignored alongside
the flaky tests.
Unit tests and functional tests with test doubles are deterministic. They produce the same result
every time. When a deterministic test fails, the developer knows with certainty that they broke
something. There is no rerun. There is no “is that real?” The failure demands investigation.
Maintenance cost grows faster than value
End-to-end tests are expensive to write and expensive to maintain. A single E2E test typically
involves:
- Setting up test data across multiple services
- Navigating through UI flows with waits and retries
- Asserting on UI elements that change with every redesign
- Handling timeouts, race conditions, and flaky selectors
When a feature changes, every E2E test that touches that feature must be updated. A redesign of
the checkout page breaks 30 E2E tests even if the underlying behavior has not changed. The team
spends more time maintaining E2E tests than writing new features.
Unit tests are cheap to write and cheap to maintain. They test behavior, not UI layout. A function
that calculates a discount does not care whether the button is blue or green. When the discount
logic changes, one or two unit tests need updating - not thirty browser flows.
It couples your pipeline to external systems
When most of your tests are end-to-end or integration tests that hit real services, your ability
to deploy depends on every system in the chain being available and healthy. If the payment
provider’s sandbox is down, your pipeline fails. If the shared staging database is slow, your
tests time out. If another team deployed a breaking change to a shared service, your tests fail
even though your code is correct.
This is the opposite of what CD requires. Continuous delivery demands that your team can deploy
independently, at any time, regardless of the state of external systems. A test architecture
built on E2E tests makes your deployment hostage to every dependency in your ecosystem.
A suite built on unit tests, functional tests, and contract tests runs entirely within your
control. External dependencies are replaced with test doubles that are validated by contract
tests. Your pipeline can tell you “this change is safe to deploy” even if every external system
is offline.
Impact on continuous delivery
The inverted pyramid makes CD impossible in practice even if all the other pieces are in place.
The pipeline takes too long to support frequent integration. Flaky failures erode trust in the
automated quality gates. Developers bypass the tests or batch up changes to avoid the wait. The
team gravitates toward manual verification before deploying because they do not trust the
automated suite.
A team that deploys weekly with a 40-minute flaky suite cannot deploy daily without either fixing
the test architecture or abandoning automated quality gates. Neither option is acceptable.
Fixing the architecture is the only sustainable path.
How to Fix It
Inverting the pyramid does not mean deleting all your E2E tests and writing unit tests from
scratch. It means shifting the balance deliberately over time so that most confidence comes from
fast, deterministic tests and only a small amount comes from slow, non-deterministic ones.
Step 1: Audit your current test distribution (Week 1)
Count your tests by type and measure their characteristics:
| Test type |
Count |
Total duration |
Flaky? |
Requires external systems? |
| Unit |
? |
? |
? |
? |
| Integration |
? |
? |
? |
? |
| Functional |
? |
? |
? |
? |
| E2E |
? |
? |
? |
? |
| Manual |
? |
N/A |
N/A |
N/A |
Run the full suite three times. Note which tests fail intermittently. Record the total duration.
This is your baseline.
Move every flaky test out of the pipeline-gating suite into a separate quarantine suite. This is
not deleting them - it is removing them from the critical path so that real failures are visible.
For each quarantined test, decide:
- Fix it if the behavior it tests is important and the flakiness has a solvable cause (timing
dependency, shared state, test order dependency).
- Replace it with a faster, deterministic test that covers the same behavior at a lower level.
- Delete it if the behavior is already covered by other tests or is not worth the maintenance
cost.
Target: zero flaky tests in the pipeline-gating suite by end of week.
Step 3: Push tests down the pyramid (Weeks 2-4)
For each E2E test in your suite, ask: “Can the behavior this test verifies be tested at a lower
level?”
Most of the time, the answer is yes. An E2E test that verifies “user can apply a discount code”
is actually testing three things:
- The discount calculation logic (testable with a unit test)
- The API endpoint that accepts the code (testable with a functional test)
- The UI flow for entering the code (testable with a component test)
Write the lower-level tests first. Once they exist and pass, the E2E test is redundant for gating
purposes. Move it to a post-deployment smoke suite or delete it.
Work through your E2E suite systematically, starting with the slowest and most flaky tests. Each
test you push down the pyramid makes the suite faster and more reliable.
Step 4: Replace external dependencies with test doubles (Weeks 2-4)
Identify every test that calls a real external service and replace the dependency:
| Dependency type |
Test double approach |
| Database |
In-memory database, testcontainers, or repository fakes |
| External HTTP API |
HTTP stubs (WireMock, nock, MSW) |
| Message queue |
In-memory fake or test spy |
| File storage |
In-memory filesystem or temp directory |
| Third-party service |
Stub that returns canned responses |
Validate your test doubles with contract tests that run asynchronously. This ensures your doubles
stay accurate without coupling your pipeline to external systems.
Step 5: Adopt the test-for-every-change rule (Ongoing)
New code should be tested at the lowest possible level. Establish the team norm:
- Every new function with logic gets a unit test.
- Every new API endpoint or integration boundary gets a functional test.
- E2E tests are only added for critical smoke paths - not for every feature.
- Every bug fix includes a regression test at the lowest level that catches the bug.
Over time, this rule shifts the pyramid naturally. New code enters the codebase with the right
test distribution even as the team works through the legacy E2E suite.
Step 6: Address the objections
| Objection |
Response |
| “Unit tests with mocks don’t test anything real” |
They test logic, which is where most bugs live. A discount calculation that returns the wrong number is a real bug whether it is caught by a unit test or an E2E test. The unit test catches it in milliseconds. The E2E test catches it in minutes - if it is not flaky that day. |
| “E2E tests catch integration bugs that unit tests miss” |
Functional tests with test doubles catch most integration bugs. Contract tests catch the rest. The small number of integration bugs that only E2E can find do not justify a suite of hundreds of slow, flaky E2E tests. |
| “We can’t delete E2E tests - they’re our safety net” |
They are a safety net with holes. Flaky tests miss real failures. Slow tests delay feedback. Replace them with faster, deterministic tests that actually catch bugs reliably, then keep a small E2E smoke suite for post-deployment verification. |
| “Our code is too tightly coupled to unit test” |
That is an architecture problem, not a testing problem. Start by writing tests for new code and refactoring existing code as you touch it. Use the Strangler Fig pattern - wrap untestable code in a testable layer. |
| “We don’t have time to rewrite the test suite” |
You are already paying the cost of the inverted pyramid in slow feedback, flaky builds, and manual verification. The fix is incremental: push one test down the pyramid each day. After a month, the suite is measurably faster and more reliable. |
Measuring Progress
| Metric |
What to look for |
| Test suite duration |
Should decrease toward under 10 minutes |
| Flaky test count in gating suite |
Should reach and stay at zero |
| Test distribution (unit : integration : E2E ratio) |
Unit tests should be the largest category |
| Pipeline pass rate |
Should increase as flaky tests are removed |
| Developers running tests locally |
Should increase as the suite gets faster |
| External dependencies in gating tests |
Should reach zero |
Related Content
1.4 - Pipeline and Infrastructure
Anti-patterns in build pipelines, deployment automation, and infrastructure management that block continuous delivery.
These anti-patterns affect the automated path from commit to production. They create manual steps,
slow feedback, and fragile deployments that prevent the reliable, repeatable delivery that
continuous delivery requires.
1.4.1 - No Pipeline Exists
Builds and deployments are manual processes. Someone runs a script on their laptop. There is no automated path from commit to production.
Category: Pipeline & Infrastructure | Quality Impact: Critical
What This Looks Like
Deploying to production requires a person. Someone opens a terminal, SSHs into a server, pulls the
latest code, runs a build command, and restarts a service. Or they download an artifact from a
shared drive, copy it to the right server, and run an install script. The steps live in a wiki page,
a shared document, or in someone’s head. Every deployment is a manual operation performed by
whoever knows the procedure.
There is no automation connecting a code commit to a running system. A developer finishes a feature,
pushes to the repository, and then a separate human process begins: someone must decide it is time
to deploy, gather the right artifacts, prepare the target environment, execute the deployment, and
verify that it worked. Each of these steps involves manual effort and human judgment.
The deployment procedure is a craft. Certain people are known for being “good at deploys.” New team
members are warned not to attempt deployments alone. When the person who knows the procedure is
unavailable, deployments wait. The team has learned to treat deployment as a risky, specialized
activity that requires care and experience.
Common variations:
- The deploy script on someone’s laptop. A shell script that automates some steps, but it lives
on one developer’s machine. Nobody else has it. When that developer is out, the team either waits
or reverse-engineers the procedure from the wiki.
- The manual checklist. A document with 30 steps: “SSH into server X, run this command, check
this log file, restart this service.” The checklist is usually out of date. Steps are missing or
in the wrong order. The person deploying adds corrections in the margins.
- The “only Dave can deploy” pattern. One person has the credentials, the knowledge, and the
muscle memory to deploy reliably. Deployments are scheduled around Dave’s availability. Dave
is a single point of failure and cannot take vacation during release weeks.
- The FTP deployment. Build artifacts are uploaded to a server via FTP, SCP, or a file share.
The person deploying must know which files go where, which config files to update, and which
services to restart. A missed file means a broken deployment.
- The manual build. There is no automated build at all. A developer runs the build command
locally, checks that it compiles, and copies the output to the deployment target. The build
that was tested is not necessarily the build that gets deployed.
The telltale sign: if deploying requires a specific person, a specific machine, or a specific
document that must be followed step by step, no pipeline exists.
Why This Is a Problem
The absence of a pipeline means every deployment is a unique event. No two deployments are
identical because human hands are involved in every step. This creates risk, waste, and
unpredictability that compound with every release.
It reduces quality
Without a pipeline, there is no enforced quality gate between a developer’s commit and production.
Tests may or may not be run before deploying. Static analysis may or may not be checked. The
artifact that reaches production may or may not be the same artifact that was tested. Every “may
or may not” is a gap where defects slip through.
Manual deployments also introduce their own defects. A step skipped in the checklist, a wrong
version of a config file, a service restarted in the wrong order - these are deployment bugs that
have nothing to do with the code. They are caused by the deployment process itself. The more manual
steps involved, the more opportunities for human error.
A pipeline eliminates both categories of risk. Every commit passes through the same automated
checks. The artifact that is tested is the artifact that is deployed. There are no skipped steps
because the steps are encoded in the pipeline definition and execute the same way every time.
It increases rework
Manual deployments are slow, so teams batch changes to reduce deployment frequency. Batching means
more changes per deployment. More changes means harder debugging when something goes wrong, because
any of dozens of commits could be the cause. The team spends hours bisecting changes to find the
one that broke production.
Failed manual deployments create their own rework. A deployment that goes wrong must be diagnosed,
rolled back (if rollback is even possible), and re-attempted. Each re-attempt burns time and
attention. If the deployment corrupted data or left the system in a partial state, the recovery
effort dwarfs the original deployment.
Rework also accumulates in the deployment procedure itself. Every deployment surfaces a new edge
case or a new prerequisite that was not in the checklist. Someone updates the wiki. The next
deployer reads the old version. The procedure is never quite right because manual procedures
cannot be versioned, tested, or reviewed the way code can.
With an automated pipeline, deployments are fast and repeatable. Small changes deploy individually.
Failed deployments are rolled back automatically. The pipeline definition is code - versioned,
reviewed, and tested like any other part of the system.
It makes delivery timelines unpredictable
A manual deployment takes an unpredictable amount of time. The optimistic case is 30 minutes. The
realistic case includes troubleshooting unexpected errors, waiting for the right person to be
available, and re-running steps that failed. A “quick deploy” can easily consume half a day.
The team cannot commit to release dates because the deployment itself is a variable. “We can deploy
on Tuesday” becomes “we can start the deployment on Tuesday, and we’ll know by Wednesday whether it
worked.” Stakeholders learn that deployment dates are approximate, not firm.
The unpredictability also limits deployment frequency. If each deployment takes hours of manual
effort and carries risk of failure, the team deploys as infrequently as possible. This increases
batch size, which increases risk, which makes deployments even more painful, which further
discourages frequent deployment. The team is trapped in a cycle where the lack of a pipeline makes
deployments costly, and costly deployments make the lack of a pipeline seem acceptable.
An automated pipeline makes deployment duration fixed and predictable. A deploy takes the same
amount of time whether it happens once a month or ten times a day. The cost per deployment drops
to near zero, removing the incentive to batch.
It concentrates knowledge in too few people
When deployment is manual, the knowledge of how to deploy lives in people rather than in code. The
team depends on specific individuals who know the servers, the credentials, the order of
operations, and the workarounds for known issues. These individuals become bottlenecks and single
points of failure.
When the deployment expert is unavailable - sick, on vacation, or has left the company - the team
is stuck. Someone else must reconstruct the deployment procedure from incomplete documentation and
trial and error. Deployments attempted by inexperienced team members fail at higher rates, which
reinforces the belief that only experts should deploy.
A pipeline encodes deployment knowledge in an executable definition that anyone can run. New team
members deploy on their first day by triggering the pipeline. The deployment expert’s knowledge is
preserved in code rather than in their head. The bus factor for deployments moves from one to the
entire team.
Impact on continuous delivery
Continuous delivery requires an automated, repeatable pipeline that can take any commit from trunk
and deliver it to production with confidence. Without a pipeline, none of this is possible. There
is no automation to repeat. There is no confidence that the process will work the same way twice.
There is no path from commit to production that does not require a human to drive it.
The pipeline is not an optimization of manual deployment. It is a prerequisite for CD. A team
without a pipeline cannot practice CD any more than a team without source control can practice
version management. The pipeline is the foundation. Everything else - automated testing, deployment
strategies, progressive rollouts, fast rollback - depends on it existing.
How to Fix It
Step 1: Document the current manual process exactly (Week 1)
Before automating, capture what the team actually does today. Have the person who deploys most
often write down every step in order:
- What commands do they run?
- What servers do they connect to?
- What credentials do they use?
- What checks do they perform before, during, and after?
- What do they do when something goes wrong?
This document is not the solution - it is the specification for the first version of the pipeline.
Every manual step will become an automated step.
Step 2: Automate the build (Week 2)
Start with the simplest piece: turning source code into a deployable artifact without manual
intervention.
- Choose a CI server (Jenkins, GitHub Actions, GitLab CI, CircleCI, or any tool that triggers on
commit).
- Configure it to check out the code and run the build command on every push to trunk.
- Store the build output as a versioned artifact.
At this point, the team has an automated build but still deploys manually. That is fine. The
pipeline will grow incrementally.
Step 3: Add automated tests to the build (Week 3)
If the team has any automated tests, add them to the pipeline so they run after the build
succeeds. If the team has no automated tests, add one. A single test that verifies the application
starts up is more valuable than zero tests.
The pipeline should now fail if the build fails or if any test fails. This is the first automated
quality gate. No artifact is produced unless the code compiles and the tests pass.
Step 4: Automate the deployment to a non-production environment (Weeks 3-4)
Take the manual deployment steps from Step 1 and encode them in a script or pipeline stage that
deploys the tested artifact to a staging or test environment:
- Provision or configure the target environment.
- Deploy the artifact.
- Run a smoke test to verify the deployment succeeded.
The team now has a pipeline that builds, tests, and deploys to a non-production environment on
every commit. Deployments to this environment should happen without any human intervention.
Step 5: Extend the pipeline to production (Weeks 5-6)
Once the team trusts the automated deployment to non-production environments, extend it to
production:
- Add a manual approval gate if the team is not yet comfortable with fully automated production
deployments. This is a temporary step - the goal is to remove it later.
- Use the same deployment script and process for production that you use for non-production. The
only difference should be the target environment and its configuration.
- Add post-deployment verification: health checks, smoke tests, or basic monitoring checks that
confirm the deployment is healthy.
The first automated production deployment will be nerve-wracking. That is normal. Run it alongside
the manual process the first few times: deploy automatically, then verify manually. As confidence
grows, drop the manual verification.
Step 6: Address the objections (Ongoing)
| Objection |
Response |
| “Our deployments are too complex to automate” |
If a human can follow the steps, a script can execute them. Complex deployments benefit the most from automation because they have the most opportunities for human error. |
| “We don’t have time to build a pipeline” |
You are already spending time on every manual deployment. A pipeline is an investment that pays back on the second deployment and every deployment after. |
| “Only Dave knows how to deploy” |
That is the problem, not a reason to keep the status quo. Building the pipeline captures Dave’s knowledge in code. Dave should lead the pipeline effort because he knows the procedure best. |
| “What if the pipeline deploys something broken?” |
The pipeline includes automated tests and can include approval gates. A broken deployment from a pipeline is no worse than a broken deployment from a human - and the pipeline can roll back automatically. |
| “Our infrastructure doesn’t support modern CI/CD tools” |
Start with a shell script triggered by a cron job or a webhook. A pipeline does not require Kubernetes or cloud-native infrastructure. It requires automation of the steps you already perform manually. |
Measuring Progress
| Metric |
What to look for |
| Manual steps in the deployment process |
Should decrease to zero |
| Deployment duration |
Should decrease and stabilize as manual steps are automated |
| Release frequency |
Should increase as deployment cost drops |
| Deployment failure rate |
Should decrease as human error is removed |
| People who can deploy to production |
Should increase from one or two to the entire team |
| Lead time |
Should decrease as the manual deployment bottleneck is eliminated |
Related Content
1.4.2 - Manual Deployments
The build is automated but deployment is not. Someone must SSH into servers, run scripts, and shepherd each release to production by hand.
Category: Pipeline & Infrastructure | Quality Impact: High
What This Looks Like
The team has a CI server. Code is built and tested automatically on every push. The pipeline
dashboard is green. But between “pipeline passed” and “code running in production,” there is a
person. Someone must log into a deployment tool, click a button, select the right artifact, choose
the right environment, and watch the output scroll by. Or they SSH into servers, pull the artifact,
run migration scripts, restart services, and verify health checks - all by hand.
The team may not even think of this as a problem. The build is automated. The tests run
automatically. Deployment is “just the last step.” But that last step takes 30 minutes to an hour
of focused human attention, can only happen when the right person is available, and fails often
enough that nobody wants to do it on a Friday afternoon.
Deployment has its own rituals. The team announces in Slack that a deploy is starting. Other
developers stop merging. Someone watches the logs. Another person checks the monitoring dashboard.
When it is done, someone posts a confirmation. The whole team holds its breath during the process
and exhales when it works. This ceremony happens every time, whether the release is one commit or
fifty.
Common variations:
- The button-click deploy. The CI/CD tool has a “deploy to production” button, but a human must
click it and then monitor the result. The automation exists but is not trusted to run
unattended. Someone watches every deployment from start to finish.
- The runbook deploy. A document describes the deployment steps in order. The deployer follows
the runbook, executing commands manually at each step. The runbook was written months ago and
has handwritten corrections in the margins. Some steps have been added, others crossed out.
- The SSH-and-pray deploy. The deployer SSHs into each server individually, pulls code or
copies artifacts, runs scripts, and restarts services. The order matters. Missing a server means
a partial deployment. The deployer keeps a mental checklist of which servers are done.
- The release coordinator deploy. One person coordinates the deployment across multiple systems.
They send messages to different teams: “deploy service A now,” “run the database migration,”
“restart the cache.” The deployment is a choreographed multi-person event.
- The after-hours deploy. Deployments happen only outside business hours because the manual
process is risky enough that the team wants minimal user traffic. Deployers work evenings or
weekends. The deployment window is sacred and stressful.
The telltale sign: if the pipeline is green but the team still needs to “do a deploy” as a
separate activity, deployment is manual.
Why This Is a Problem
A manual deployment negates much of the value that an automated build and test pipeline provides.
The pipeline can validate code in minutes, but if the last mile to production requires a human,
the delivery speed is limited by that human’s availability, attention, and reliability.
It reduces quality
Manual deployment introduces a category of defects that have nothing to do with the code. A
deployer who runs migration scripts in the wrong order corrupts data. A deployer who forgets to
update a config file on one of four servers creates inconsistent behavior. A deployer who restarts
services too quickly triggers a cascade of connection errors. These are process defects - bugs
introduced by the deployment method, not the software.
Manual deployments also degrade the quality signal from the pipeline. The pipeline tests a specific
artifact in a specific configuration. If the deployer manually adjusts configuration, selects a
different artifact version, or skips a verification step, the deployed system no longer matches
what the pipeline validated. The pipeline said “this is safe to deploy,” but what actually reached
production is something slightly different.
Automated deployment eliminates process defects by executing the same steps in the same order
every time. The artifact the pipeline tested is the artifact that reaches production. Configuration
is applied from version-controlled definitions, not from human memory. The deployment is identical
whether it happens at 2 PM on Tuesday or 3 AM on Saturday.
It increases rework
Because manual deployments are slow and risky, teams batch changes. Instead of deploying each
commit individually, they accumulate a week or two of changes and deploy them together. When
something breaks in production, the team must determine which of thirty commits caused the problem.
This diagnosis takes hours. The fix takes more hours. If the fix itself requires a deployment, the
team must go through the manual process again.
Failed deployments are especially costly. A manual deployment that leaves the system in a broken
state requires manual recovery. The deployer must diagnose what went wrong, decide whether to roll
forward or roll back, and execute the recovery steps by hand. If the deployment was a multi-server
process and some servers are on the new version while others are on the old version, the recovery
is even harder. The team may spend more time recovering from a failed deployment than they spent
on the deployment itself.
With automated deployments, each commit deploys individually. When something breaks, the cause is
obvious - it is the one commit that just deployed. Rollback is a single action, not a manual
recovery effort. The time from “something is wrong” to “the previous version is running” is
minutes, not hours.
It makes delivery timelines unpredictable
The gap between “pipeline is green” and “code is in production” is measured in human availability.
If the deployer is in a meeting, the deployment waits. If the deployer is on vacation, the
deployment waits longer. If the deployment fails and the deployer needs help, the recovery depends
on who else is around.
This human dependency makes release timing unpredictable. The team cannot promise “this fix will be
in production in 30 minutes” because the deployment requires a person who may not be available for
hours. Urgent fixes wait for deployment windows. Critical patches wait for the release coordinator
to finish lunch.
The batching effect adds another layer of unpredictability. When teams batch changes to reduce
deployment frequency, each deployment becomes larger and riskier. Larger deployments take longer to
verify and are more likely to fail. The team cannot predict how long the deployment will take
because they cannot predict what will go wrong with a batch of thirty changes.
Automated deployment makes the time from “pipeline green” to “running in production” fixed and
predictable. It takes the same number of minutes regardless of who is available, what day it is,
or how many other things are happening. The team can promise delivery timelines because the
deployment is a deterministic process, not a human activity.
It prevents fast recovery
When production breaks, speed of recovery determines the blast radius. A team that can deploy a
fix in five minutes limits the damage. A team that needs 45 minutes of manual deployment work
exposes users to the problem for 45 minutes plus diagnosis time.
Manual rollback is even worse. Many teams with manual deployments have no practiced rollback
procedure at all. “Rollback” means “re-deploy the previous version,” which means running the
entire manual deployment process again with a different artifact. If the deployment process takes
an hour, rollback takes an hour. If the deployment process requires a specific person, rollback
requires that same person.
Some manual deployments cannot be cleanly rolled back. Database migrations that ran during the
deployment may not have reverse scripts. Config changes applied to servers may not have been
tracked. The team is left doing a forward fix under pressure, manually deploying a patch through
the same slow process that caused the problem.
Automated pipelines with automated rollback can revert to the previous version in minutes. The
rollback follows the same tested path as the deployment. No human judgment is required. The team’s
mean time to repair drops from hours to minutes.
Impact on continuous delivery
Continuous delivery means any commit that passes the pipeline can be released to production at any
time with confidence. Manual deployment breaks this definition at “at any time.” The commit can
only be released when a human is available to perform the deployment, when the deployment window
is open, and when the team is ready to dedicate attention to watching the process.
The manual deployment step is the bottleneck that limits everything upstream. The pipeline can
validate commits in 10 minutes, but if deployment takes an hour of human effort, the team will
never deploy more than a few times per day at best. In practice, teams with manual deployments
release weekly or biweekly because the deployment overhead makes anything more frequent
impractical.
The pipeline is only half the delivery system. Automating the build and tests without automating
the deployment is like paving a highway that ends in a dirt road. The speed of the paved section
is irrelevant if every journey ends with a slow, bumpy last mile.
How to Fix It
Step 1: Script the current manual process (Week 1)
Take the runbook, the checklist, or the knowledge in the deployer’s head and turn it into a
script. Do not redesign the process yet - just encode what the team already does.
- Record a deployment from start to finish. Note every command, every server, every check.
- Write a script that executes those steps in order.
- Store the script in version control alongside the application code.
The script will be rough. It will have hardcoded values and assumptions. That is fine. The goal
is to make the deployment reproducible by any team member, not to make it perfect.
Step 2: Run the script from the pipeline (Week 2)
Connect the deployment script to the CI/CD pipeline so it runs automatically after the build and
tests pass. Start with a non-production environment:
- Add a deployment stage to the pipeline that targets a staging or test environment.
- Trigger it automatically on every successful build.
- Add a smoke test after deployment to verify it worked.
The team now gets automatic deployments to a non-production environment on every commit. This
builds confidence in the automation and surfaces problems early.
Step 3: Externalize configuration and secrets (Weeks 2-3)
Manual deployments often involve editing config files on servers or passing environment-specific
values by hand. Move these out of the manual process:
- Store environment-specific configuration in a config management system or environment variables
managed by the pipeline.
- Move secrets to a secrets manager (Vault, AWS Secrets Manager, Azure Key Vault, or even
encrypted pipeline variables as a starting point).
- Ensure the deployment script reads configuration from these sources rather than from hardcoded
values or manual input.
This step is critical because manual configuration is one of the most common sources of deployment
failures. Automating deployment without automating configuration just moves the manual step.
Step 4: Automate production deployment with a gate (Weeks 3-4)
Extend the pipeline to deploy to production using the same script and process:
- Add a production deployment stage after the non-production deployment succeeds.
- Include a manual approval gate - a button that a team member clicks to authorize the production
deployment. This is a temporary safety net while the team builds confidence.
- Add post-deployment health checks that automatically verify the deployment succeeded.
- Add automated rollback that triggers if the health checks fail.
The approval gate means a human still decides when to deploy, but the deployment itself is fully
automated. No SSHing. No manual steps. No watching logs scroll by.
Step 5: Remove the manual gate (Weeks 6-8)
Once the team has seen the automated production deployment succeed repeatedly, remove the manual
approval gate. The pipeline now deploys to production automatically when all checks pass.
This is the hardest step emotionally. The team will resist. Expect these objections:
| Objection |
Response |
| “We need a human to decide when to deploy” |
Why? If the pipeline validates the code and the deployment process is automated and tested, what decision is the human making? If the answer is “checking that nothing looks weird,” that check should be automated. |
| “What if it deploys during peak traffic?” |
Use deployment windows in the pipeline configuration, or use progressive rollout strategies that limit blast radius regardless of traffic. |
| “We had a bad deployment last month” |
Was it caused by the automation or by a gap in testing? If the tests missed a defect, the fix is better tests, not a manual gate. If the deployment process itself failed, the fix is better deployment automation, not a human watching. |
| “Compliance requires manual approval” |
Review the actual compliance requirement. Most require evidence of approval, not a human clicking a button at deployment time. A code review approval, an automated policy check, or an audit log of the pipeline run often satisfies the requirement. |
| “Our deployments require coordination with other teams” |
Automate the coordination. Use API contracts, deployment dependencies in the pipeline, or event-based triggers. If another team must deploy first, encode that dependency rather than coordinating in Slack. |
Step 6: Add deployment observability (Ongoing)
Once deployments are automated, invest in knowing whether they worked:
- Monitor error rates, latency, and key business metrics after every deployment.
- Set up automatic rollback triggers tied to these metrics.
- Track deployment frequency, duration, and failure rate over time.
The team should be able to deploy without watching. The monitoring watches for them.
Measuring Progress
| Metric |
What to look for |
| Manual steps per deployment |
Should reach zero |
| Deployment duration (human time) |
Should drop from hours to zero - the pipeline does the work |
| Release frequency |
Should increase as deployment friction drops |
| Change fail rate |
Should decrease as manual process defects are eliminated |
| Mean time to repair |
Should decrease as rollback becomes automated |
| Lead time |
Should decrease as the deployment bottleneck is removed |
Related Content
1.4.3 - Snowflake Environments
Each environment is hand-configured and unique. Nobody knows exactly what is running where. Configuration drift is constant.
Category: Pipeline & Infrastructure | Quality Impact: High
What This Looks Like
Staging has a different version of the database than production. The dev environment has a library
installed that nobody remembers adding. Production has a configuration file that was edited by hand
six months ago during an incident and never committed to source control. Nobody is sure all three
environments are running the same OS patch level.
A developer asks “why does this work in staging but not in production?” The answer takes hours to
find because it requires comparing configurations across environments by hand - diffing config
files, checking installed packages, verifying environment variables one by one.
Common variations:
- The hand-built server. Someone provisioned the production server two years ago. They followed
a wiki page that has since been edited, moved, or deleted. Nobody has provisioned a new one
since. If the server dies, nobody is confident they can recreate it.
- The magic SSH session. During an incident, someone SSH-ed into production and changed a
config value. It fixed the problem. Nobody updated the deployment scripts, the infrastructure
code, or the documentation. The next deployment overwrites the fix - or doesn’t, depending on
which files the deployment touches.
- The shared dev environment. A single development or staging environment is shared by the
whole team. One developer installs a library, another changes a config value, a third adds a
cron job. The environment drifts from any known baseline within weeks.
- The “production is special” mindset. Dev and staging environments are provisioned with
scripts, but production was set up differently because of “security requirements” or “scale
differences.” The result is that the environments the team tests against are structurally
different from the one that serves users.
- The environment with a name. Environments have names like “staging-v2” or “qa-new” because
someone created a new one alongside the old one. Both still exist. Nobody is sure which one the
pipeline deploys to.
The telltale sign: deploying the same artifact to two environments produces different results,
and the team’s first instinct is to check environment configuration rather than application code.
Why This Is a Problem
Snowflake environments undermine the fundamental premise of testing: that the behavior you observe
in one environment predicts the behavior you will see in another. When every environment is
unique, testing in staging tells you what works in staging - nothing more.
It reduces quality
When environments differ, bugs hide in the gaps. An application that works in staging may fail in
production because of a different library version, a missing environment variable, or a filesystem
permission that was set by hand. These bugs are invisible to testing because the test environment
does not reproduce the conditions that trigger them.
The team learns this the hard way, one production incident at a time. Each incident teaches the
team that “passed in staging” does not mean “will work in production.” This erodes trust in the
entire testing and deployment process. Developers start adding manual verification steps -
checking production configs by hand before deploying, running smoke tests manually after
deployment, asking the ops team to “keep an eye on things.”
When environments are identical and provisioned from the same code, the gap between staging and
production disappears. What works in staging works in production because the environments are the
same. Testing produces reliable results.
It increases rework
Snowflake environments cause two categories of rework. First, developers spend hours debugging
environment-specific issues that have nothing to do with application code. “Why does this work on
my machine but not in CI?” leads to comparing configurations, googling error messages related to
version mismatches, and patching environments by hand. This time is pure waste.
Second, production incidents caused by environment drift require investigation, rollback, and
fixes to both the application and the environment. A configuration difference that causes a
production failure might take five minutes to fix once identified, but identifying it takes hours
because nobody knows what the correct configuration should be.
Teams with reproducible environments spend zero time on environment debugging. If an environment
is wrong, they destroy it and recreate it from code. The investigation time drops from hours to
minutes.
It makes delivery timelines unpredictable
Deploying to a snowflake environment is unpredictable because the environment itself is an
unknown variable. The same deployment might succeed on Monday and fail on Friday because someone
changed something in the environment between the two deploys. The team cannot predict how long a
deployment will take because they cannot predict what environment issues they will encounter.
This unpredictability compounds across environments. A change must pass through dev, staging, and
production, and each environment is a unique snowflake with its own potential for surprise. A
deployment that should take minutes takes hours because each environment reveals a new
configuration issue.
Reproducible environments make deployment time a constant. The same artifact deployed to the same
environment specification produces the same result every time. Deployment becomes a predictable
step in the pipeline rather than an adventure.
It makes environments a scarce resource
When environments are hand-configured, creating a new one is expensive. It takes hours or days of
manual work. The team has a small number of shared environments and must coordinate access. “Can
I use staging today?” becomes a daily question. Teams queue up for access to the one environment
that resembles production.
This scarcity blocks parallel work. Two developers who both need to test a database migration
cannot do so simultaneously if there is only one staging environment. One waits while the other
finishes. Features that could be validated in parallel are serialized through a shared
environment bottleneck.
When environments are defined as code, spinning up a new one is a pipeline step that takes
minutes. Each developer or feature branch can have its own environment. There is no contention
because environments are disposable and cheap.
Impact on continuous delivery
Continuous delivery requires that any change can move from commit to production through a fully
automated pipeline. Snowflake environments break this in multiple ways. The pipeline cannot
provision environments automatically if environments are hand-configured. Testing results are
unreliable because environments differ. Deployments fail unpredictably because of configuration
drift.
A team with snowflake environments cannot trust their pipeline. They cannot deploy frequently
because each deployment risks hitting an environment-specific issue. They cannot automate
fully because the environments require manual intervention. The path from commit to production
is neither continuous nor reliable.
How to Fix It
Step 1: Document what exists today (Week 1)
Before automating anything, capture the current state of each environment:
- For each environment (dev, staging, production), record: OS version, installed packages,
configuration files, environment variables, external service connections, and any manual
customizations.
- Diff the environments against each other. Note every difference.
- Classify each difference as intentional (e.g., production uses a larger instance size) or
accidental (e.g., staging has an old library version nobody updated).
This audit surfaces the drift. Most teams are surprised by how many accidental differences exist.
Step 2: Define one environment specification (Weeks 2-3)
Choose an infrastructure-as-code tool (Terraform, Pulumi, CloudFormation, Ansible, or similar)
and write a specification for one environment. Start with the environment you understand best -
usually staging.
The specification should define:
- Base infrastructure (servers, containers, networking)
- Installed packages and their versions
- Configuration files and their contents
- Environment variables with placeholder values
- Any scripts that run at provisioning time
Verify the specification by destroying the staging environment and recreating it from code. If
the recreated environment works, the specification is correct. If it does not, fix the
specification until it does.
Step 3: Parameterize for environment differences (Week 3)
Intentional differences between environments (instance sizes, database connection strings, API
keys) become parameters, not separate specifications. One specification with environment-specific
variables:
| Parameter |
Dev |
Staging |
Production |
| Instance size |
small |
medium |
large |
| Database host |
dev-db.internal |
staging-db.internal |
prod-db.internal |
| Log level |
debug |
info |
warn |
| Replica count |
1 |
2 |
3 |
The structure is identical. Only the values change. This eliminates accidental drift because every
environment is built from the same template.
Step 4: Provision environments through the pipeline (Week 4)
Add environment provisioning to the deployment pipeline:
- Before deploying to an environment, the pipeline provisions (or updates) it from the
infrastructure code.
- The application artifact is deployed to the freshly provisioned environment.
- If provisioning or deployment fails, the pipeline fails - no manual intervention.
This closes the loop. Environments cannot drift because they are recreated or reconciled on
every deployment. Manual SSH sessions and hand edits have no lasting effect because the next
pipeline run overwrites them.
Step 5: Make environments disposable (Week 5+)
The ultimate goal is that any environment can be destroyed and recreated in minutes with no data
loss and no human intervention:
- Practice destroying and recreating staging weekly. This verifies the specification stays
accurate and builds team confidence.
- Provision ephemeral environments for feature branches or pull requests. Let the pipeline
create and destroy them automatically.
- If recreating production is not feasible yet (stateful systems, licensing), ensure you can
provision a production-identical environment for testing at any time.
| Objection |
Response |
| “Production has unique requirements we can’t codify” |
If a requirement exists only in production and is not captured in code, it is at risk of being lost. Codify it. If it is truly unique, it belongs in a parameter, not a hand-edit. |
| “We don’t have time to learn infrastructure-as-code” |
You are already spending that time debugging environment drift. The investment pays for itself within weeks. Start with the simplest tool that works for your platform. |
| “Our environments are managed by another team” |
Work with them. Provide the specification. If they provision from your code, you both benefit: they have a reproducible process and you have predictable environments. |
| “Containers solve this problem” |
Containers solve application-level consistency. You still need infrastructure-as-code for the platform the containers run on - networking, storage, secrets, load balancers. Containers are part of the solution, not the whole solution. |
Measuring Progress
| Metric |
What to look for |
| Environment provisioning time |
Should decrease from hours/days to minutes |
| Configuration differences between environments |
Should reach zero accidental differences |
| “Works in staging but not production” incidents |
Should drop to near zero |
| Change fail rate |
Should decrease as environment parity improves |
| Mean time to repair |
Should decrease as environments become reproducible |
| Time spent debugging environment issues |
Track informally - should approach zero |
Related Content
1.5 - Organizational and Cultural
Anti-patterns in team culture, management practices, and organizational structure that block continuous delivery.
These anti-patterns affect the human and organizational side of delivery. They create
misaligned incentives, erode trust, and block the cultural changes that continuous delivery
requires. Technical practices alone cannot overcome a culture that works against them.
1.5.1 - Change Advisory Board Gates
Manual committee approval required for every production change. Meetings are weekly. One-line fixes wait alongside major migrations.
Category: Organizational & Cultural | Quality Impact: High
What This Looks Like
Before any change can reach production, it must be submitted to the Change Advisory Board. The
developer fills out a change request form: description of the change, impact assessment, rollback
plan, testing evidence, and approval signatures. The form goes into a queue. The CAB meets once
a week - sometimes every two weeks - to review the queue. Each change gets a few minutes of
discussion. The board approves, rejects, or requests more information.
A one-line configuration fix that a developer finished on Monday waits until Thursday’s CAB
meeting. If the board asks a question, the change waits until the next meeting. A two-line bug
fix sits in the same queue as a database migration, reviewed by the same people with the same
ceremony.
Common variations:
- The rubber-stamp CAB. The board approves everything. Nobody reads the change requests
carefully because the volume is too high and the context is too shallow. The meeting exists
to satisfy an audit requirement, not to catch problems. It adds delay without adding safety.
- The bottleneck approver. One person on the CAB must approve every change. That person is
in six other meetings, has 40 pending reviews, and is on vacation next week. Deployments
stop when they are unavailable.
- The emergency change process. Urgent fixes bypass the CAB through an “emergency change”
procedure that requires director-level approval and a post-hoc review. The emergency process
is faster, so teams learn to label everything urgent. The CAB process is for scheduled changes,
and fewer changes are scheduled.
- The change freeze. Certain periods - end of quarter, major events, holidays - are declared
change-free zones. No production changes for days or weeks. Changes pile up during the freeze
and deploy in a large batch afterward, which is exactly the high-risk event the freeze was
meant to prevent.
- The form-driven process. The change request template has 15 fields, most of which are
irrelevant for small changes. Developers spend more time filling out the form than making the
change. Some fields require information the developer does not have, so they make something up.
The telltale sign: a developer finishes a change and says “now I need to submit it to the CAB”
with the same tone they would use for “now I need to go to the dentist.”
Why This Is a Problem
CAB gates exist to reduce risk. In practice, they increase risk by creating delay, encouraging
batching, and providing a false sense of security. The review is too shallow to catch real
problems and too slow to enable fast delivery.
It reduces quality
A CAB review is a review by people who did not write the code, did not test it, and often do not
understand the system it affects. A board member scanning a change request form for five minutes
cannot assess the quality of a code change. They can check that the form is filled out. They
cannot check that the change is safe.
The real quality checks - automated tests, code review by peers, deployment verification - happen
before the CAB sees the change. The CAB adds nothing to quality because it reviews paperwork, not
code. The developer who wrote the tests and the reviewer who read the diff know far more about
the change’s risk than a board member reading a summary.
Meanwhile, the delay the CAB introduces actively harms quality. A bug fix that is ready on Monday
but cannot deploy until Thursday means users experience the bug for three extra days. A security
patch that waits for weekly approval is a vulnerability window measured in days.
Teams without CAB gates deploy quality checks into the pipeline itself: automated tests, security
scans, peer review, and deployment verification. These checks are faster, more thorough, and
more reliable than a weekly committee meeting.
It increases rework
The CAB process generates significant administrative overhead. For every change, a developer must
write a change request, gather approval signatures, and attend (or wait for) the board meeting.
This overhead is the same whether the change is a one-line typo fix or a major feature.
When the CAB requests more information or rejects a change, the cycle restarts. The developer
updates the form, resubmits, and waits for the next meeting. A change that was ready to deploy
a week ago sits in a review loop while the developer has moved on to other work. Picking it back
up costs context-switching time.
The batching effect creates its own rework. When changes are delayed by the CAB process, they
accumulate. Developers merge multiple changes to avoid submitting multiple requests. Larger
batches are harder to review, harder to test, and more likely to cause problems. When a problem
occurs, it is harder to identify which change in the batch caused it.
It makes delivery timelines unpredictable
The CAB introduces a fixed delay into every deployment. If the board meets weekly, the minimum
time from “change ready” to “change deployed” is up to a week, depending on when the change
was finished relative to the meeting schedule. This delay is independent of the change’s size,
risk, or urgency.
The delay is also variable. A change submitted on Monday might be approved Thursday. A change
submitted on Friday waits until the following Thursday. If the board requests revisions, add
another week. Developers cannot predict when their change will reach production because the
timeline depends on a meeting schedule and a queue they do not control.
This unpredictability makes it impossible to make reliable commitments. When a stakeholder asks
“when will this be live?” the developer must account for development time plus an unpredictable
CAB delay. The answer becomes “sometime in the next one to three weeks” for a change that took
two hours to build.
It creates a false sense of security
The most dangerous effect of the CAB is the belief that it prevents incidents. It does not. The
board reviews paperwork, not running systems. A well-written change request for a dangerous
change will be approved. A poorly written request for a safe change will be questioned. The
correlation between CAB approval and deployment safety is weak at best.
Studies of high-performing delivery organizations consistently show that external change approval
processes do not reduce failure rates. The 2019 Accelerate State of DevOps Report found that
teams with external change approval had higher failure rates than teams using peer review and
automated checks. The CAB provides a feeling of control without the substance.
This false sense of security is harmful because it displaces investment in controls that
actually work. If the organization believes the CAB prevents incidents, there is less pressure
to invest in automated testing, deployment verification, and progressive rollout - the controls
that actually reduce deployment risk.
Impact on continuous delivery
Continuous delivery requires that any change can reach production quickly through an automated
pipeline. A weekly approval meeting is fundamentally incompatible with continuous deployment.
The math is simple. If the CAB meets weekly and reviews 20 changes per meeting, the maximum
deployment frequency is 20 per week. A team practicing CD might deploy 20 times per day. The
CAB process reduces deployment frequency by two orders of magnitude.
More importantly, the CAB process assumes that human review of change requests is a meaningful
quality gate. CD assumes that automated checks - tests, security scans, deployment verification -
are better quality gates because they are faster, more consistent, and more thorough. These are
incompatible philosophies. A team practicing CD replaces the CAB with pipeline-embedded controls
that provide equivalent (or superior) risk management without the delay.
How to Fix It
Eliminating the CAB outright is rarely possible because it exists to satisfy regulatory or
organizational governance requirements. The path forward is to replace the manual ceremony with
automated controls that satisfy the same requirements faster and more reliably.
Step 1: Classify changes by risk (Week 1)
Not all changes carry the same risk. Introduce a risk classification:
| Risk level |
Criteria |
Example |
Approval process |
| Standard |
Small, well-tested, automated rollback |
Config change, minor bug fix, dependency update |
Peer review + passing pipeline = auto-approved |
| Normal |
Medium scope, well-tested |
New feature behind a feature flag, API endpoint addition |
Peer review + passing pipeline + team lead sign-off |
| High |
Large scope, architectural, or compliance-sensitive |
Database migration, authentication change, PCI-scoped change |
Peer review + passing pipeline + architecture review |
The goal is to route 80-90% of changes through the standard process, which requires no CAB
involvement at all.
Step 2: Define pipeline controls that replace CAB review (Weeks 2-3)
For each concern the CAB currently addresses, implement an automated alternative:
| CAB concern |
Automated replacement |
| “Will this change break something?” |
Automated test suite with high coverage, pipeline-gated |
| “Is there a rollback plan?” |
Automated rollback built into the deployment pipeline |
| “Has this been tested?” |
Test results attached to every change as pipeline evidence |
| “Is this change authorized?” |
Peer code review with approval recorded in version control |
| “Do we have an audit trail?” |
Pipeline logs capture who changed what, when, with what test results |
Document these controls. They become the evidence that satisfies auditors in place of the CAB
meeting minutes.
Step 3: Pilot auto-approval for standard changes (Week 3)
Pick one team or one service as a pilot. Standard-risk changes from that team bypass the CAB
entirely if they meet the automated criteria:
- Code review approved by at least one peer.
- All pipeline stages passed (build, test, security scan).
- Change classified as standard risk.
- Deployment includes automated health checks and rollback capability.
Track the results: deployment frequency, change fail rate, and incident count. Compare with the
CAB-gated process.
Step 4: Present the data and expand (Weeks 4-8)
After a month of pilot data, present the results to the CAB and organizational leadership:
- How many changes were auto-approved?
- What was the change fail rate for auto-approved changes vs. CAB-reviewed changes?
- How much faster did auto-approved changes reach production?
- How many incidents were caused by auto-approved changes?
If the data shows that auto-approved changes are as safe or safer than CAB-reviewed changes
(which is the typical outcome), expand the auto-approval process to more teams and more change
types.
Step 5: Reduce the CAB to high-risk changes only (Week 8+)
With most changes flowing through automated approval, the CAB’s scope shrinks to genuinely
high-risk changes: major architectural shifts, compliance-sensitive changes, and cross-team
infrastructure modifications. These changes are infrequent enough that a review process is not
a bottleneck.
The CAB meeting frequency drops from weekly to as-needed. The board members spend their time on
changes that actually benefit from human review rather than rubber-stamping routine deployments.
| Objection |
Response |
| “The CAB is required by our compliance framework” |
Most compliance frameworks (SOX, PCI, HIPAA) require separation of duties and change control, not a specific meeting. Automated pipeline controls with audit trails satisfy the same requirements. Engage your auditors early to confirm. |
| “Without the CAB, anyone could deploy anything” |
The pipeline controls are stricter than the CAB. The CAB reviews a form for five minutes. The pipeline runs thousands of tests, security scans, and verification checks. Auto-approval is not no-approval - it is better approval. |
| “We’ve always done it this way” |
The CAB was designed for a world of monthly releases. In that world, reviewing 10 changes per month made sense. In a CD world with 10 changes per day, the same process becomes a bottleneck that adds risk instead of reducing it. |
| “What if an auto-approved change causes an incident?” |
What if a CAB-approved change causes an incident? (They do.) The question is not whether incidents happen but how quickly you detect and recover. Automated deployment verification and rollback detect and recover faster than any manual process. |
Measuring Progress
| Metric |
What to look for |
| Lead time |
Should decrease as CAB delay is removed for standard changes |
| Release frequency |
Should increase as deployment is no longer gated on weekly meetings |
| Change fail rate |
Should remain stable or decrease - proving auto-approval is safe |
| Percentage of changes auto-approved |
Should climb toward 80-90% |
| CAB meeting frequency |
Should decrease from weekly to as-needed |
| Time from “ready to deploy” to “deployed” |
Should drop from days to hours or minutes |
Related Content
1.5.2 - Pressure to Skip Testing
Management pressures developers to skip or shortcut testing to meet deadlines. The test suite rots sprint by sprint as skipped tests become the norm.
Category: Organizational & Cultural | Quality Impact: High
What This Looks Like
A deadline is approaching. The manager asks the team how things are going. A developer says the
feature is done but the tests still need to be written. The manager says “we’ll come back to the
tests after the release.” The tests are never written. Next sprint, the same thing happens. After
a few months, the team has a codebase with patches of coverage surrounded by growing deserts of
untested code.
Nobody made a deliberate decision to abandon testing. It happened one shortcut at a time, each
one justified by a deadline that felt more urgent than the test suite.
Common variations:
- “Tests are a nice-to-have.” The team treats test writing as optional scope that gets cut
when time is short. Features are estimated without testing time. Tests are a separate backlog
item that never reaches the top.
- “We’ll add tests in the hardening sprint.” Testing is deferred to a future sprint dedicated
to quality. That sprint gets postponed, shortened, or filled with the next round of urgent
features. The testing debt compounds.
- “Just get it out the door.” A manager or product owner explicitly tells developers to skip
tests for a specific release. The implicit message is that shipping matters and quality does
not. Developers who push back are seen as slow or uncooperative.
- The coverage ratchet in reverse. The team once had 70% test coverage. Each sprint, a few
untested changes slip through. Coverage drops to 60%, then 50%, then 40%. Nobody notices the
trend because each individual drop is small. By the time someone looks at the number, half the
safety net is gone.
- Testing theater. Developers write the minimum tests needed to pass a coverage gate - trivial
assertions, tests that verify getters and setters, tests that do not actually exercise
meaningful behavior. The coverage number looks healthy but the tests catch nothing.
The telltale sign: the team has a backlog of “write tests for X” tickets that are months old and
have never been started, while production incidents keep increasing.
Why This Is a Problem
Skipping tests feels like it saves time in the moment. It does not. It borrows time from the
future at a steep interest rate. The effects are invisible at first and catastrophic later.
It reduces quality
Every untested change is a change that nobody can verify automatically. The first few skipped
tests are low risk - the code is fresh in the developer’s mind and unlikely to break. But as
weeks pass, the untested code is modified by other developers who do not know the original intent.
Without tests to pin the behavior, regressions creep in undetected.
The damage accelerates. When half the codebase is untested, developers cannot tell which changes
are safe and which are risky. They treat every change as potentially dangerous, which slows them
down. Or they treat every change as probably fine, which lets bugs through. Either way, quality
suffers.
Teams that maintain their test suite catch regressions within minutes of introducing them. The
developer who caused the regression fixes it immediately because they are still working on the
relevant code. The cost of the fix is minutes, not days.
It increases rework
Untested code generates rework in two forms. First, bugs that would have been caught by tests
reach production and must be investigated, diagnosed, and fixed under pressure. A bug found by a
test costs minutes to fix. The same bug found in production costs hours - plus the cost of
the incident response, the rollback or hotfix, and the customer impact.
Second, developers working in untested areas of the codebase move slowly because they have no
safety net. They make a change, manually verify it, discover it broke something else, revert,
try again. Work that should take an hour takes a day because every change requires manual
verification.
The rework is invisible in sprint metrics. The team does not track “time spent debugging issues
that tests would have caught.” But it shows up in velocity: the team ships less and less each
sprint even as they work longer hours.
It makes delivery timelines unpredictable
When the test suite is healthy, the time from “code complete” to “deployed” is a known quantity.
The pipeline runs, tests pass, the change ships. When the test suite has been hollowed out by
months of skipped tests, that step becomes unpredictable. Some changes pass cleanly. Others
trigger production incidents that take days to resolve.
The manager who pressured the team to skip tests in order to hit a deadline ends up with less
predictable timelines, not more. Each skipped test is a small increase in the probability that a
future change will cause an unexpected failure. Over months, the cumulative probability climbs
until production incidents become a regular occurrence rather than an exception.
Teams with comprehensive test suites deliver predictably because the automated checks eliminate
the largest source of variance - undetected defects.
It creates a death spiral
The most dangerous aspect of this anti-pattern is that it is self-reinforcing. Skipping tests
leads to more bugs. More bugs lead to more time spent firefighting. More time firefighting means
less time for testing. Less testing means more bugs. The cycle accelerates.
At the same time, the codebase becomes harder to test. Code written without tests in mind tends
to be tightly coupled, dependent on global state, and difficult to isolate. The longer testing is
deferred, the more expensive it becomes to add tests later. The team’s estimate for “catching up
on testing” grows from days to weeks to months, making it even less likely that management will
allocate the time.
Eventually, the team reaches a state where the test suite is so degraded that it provides no
confidence. The team is effectively back to no test automation
but with the added burden of maintaining a broken test infrastructure that nobody trusts.
Impact on continuous delivery
Continuous delivery requires automated quality gates that the team can rely on. A test suite that
has been eroded by months of skipped tests is not a quality gate - it is a gate with widening
holes. Changes pass through it not because they are safe but because the tests that would have
caught the problems were never written.
A team cannot deploy continuously if they cannot verify continuously. When the manager says “skip
the tests, we need to ship,” they are not just deferring quality work. They are dismantling the
infrastructure that makes frequent, safe deployment possible.
How to Fix It
Step 1: Make the cost visible (Week 1)
The pressure to skip tests comes from a belief that testing is overhead rather than investment.
Change that belief with data:
- Count production incidents in the last 90 days. For each one, identify whether an automated
test could have caught it. Calculate the total hours spent on incident response.
- Measure the team’s change fail rate - the percentage of deployments that cause a failure or
require a rollback.
- Track how long manual verification takes per release. Sum the hours across the team.
Present these numbers to the manager applying pressure. Frame it concretely: “We spent 40 hours
on incident response last quarter. Thirty of those incidents would have been caught by tests that
we skipped.”
Step 2: Include testing in every estimate (Week 2)
Stop treating tests as separate work items that can be deferred:
- Agree as a team: no story is “done” until it has automated tests. This is a working agreement,
not a suggestion.
- Include testing time in every estimate. If a feature takes three days to build, the estimate is
three days - including tests. Testing is not additive; it is part of building the feature.
- Stop creating separate “write tests” tickets. Tests are part of the story, not a follow-up
task.
When a manager asks “can we skip the tests to ship faster?” the answer is “the tests are part of
shipping. Skipping them means the feature is not done.”
Step 3: Set a coverage floor and enforce it (Week 3)
Prevent further erosion with an automated guardrail:
- Measure current test coverage. Whatever it is - 30%, 50%, 70% - that is the floor.
- Configure the pipeline to fail if a change reduces coverage below the floor.
- Ratchet the floor up by 1-2 percentage points each month.
The floor makes the cost of skipping tests immediate and visible. A developer who skips tests
will see the pipeline fail. The conversation shifts from “we’ll add tests later” to “the pipeline
won’t let us merge without tests.”
Step 4: Recover coverage in high-risk areas (Weeks 3-6)
You cannot test everything retroactively. Prioritize the areas that matter most:
- Use version control history to find the files with the most changes and the most bug fixes.
These are the highest-risk areas.
- For each high-risk file, write tests for the core behavior - the functions that other code
depends on.
- Allocate a fixed percentage of each sprint (e.g., 20%) to writing tests for existing code.
This is not optional and not deferrable.
Step 5: Address the management pressure directly (Ongoing)
The root cause is a manager who sees testing as optional. This requires a direct conversation:
| What the manager says |
What to say back |
| “We don’t have time for tests” |
“We don’t have time for the production incidents that skipping tests causes. Last quarter, incidents cost us X hours.” |
| “Just this once, we’ll catch up later” |
“We said that three sprints ago. Coverage has dropped from 60% to 45%. There is no ’later’ unless we stop the bleeding now.” |
| “The customer needs this feature by Friday” |
“The customer also needs the application to work. Shipping an untested feature on Friday and a hotfix on Monday does not save time.” |
| “Other teams ship without this many tests” |
“Other teams with similar practices have a change fail rate of X%. Ours is Y%. The tests are why.” |
If the manager continues to apply pressure after seeing the data, escalate. Test suite erosion is
a technical risk that affects the entire organization’s ability to deliver. It is appropriate to
raise it with engineering leadership.
Measuring Progress
| Metric |
What to look for |
| Test coverage trend |
Should stop declining and begin climbing |
| Change fail rate |
Should decrease as coverage recovers |
| Production incidents from untested code |
Track root causes - “no test coverage” should become less frequent |
| Stories completed without tests |
Should drop to zero |
| Development cycle time |
Should stabilize as manual verification decreases |
| Sprint capacity spent on incident response |
Should decrease as fewer untested changes reach production |
Related Content
1.6 - Monitoring and Observability
Anti-patterns in monitoring, alerting, and observability that block continuous delivery.
These anti-patterns affect the team’s ability to see what is happening in production. They
create blind spots that make deployment risky, incident response slow, and confidence in
the delivery pipeline impossible to build.
1.6.1 - No Observability
The team cannot tell if a deployment is healthy. No metrics, no log aggregation, no tracing. Issues are discovered when customers call support.
Category: Monitoring & Observability | Quality Impact: High
What This Looks Like
The team deploys a change. Someone asks “is it working?” Nobody knows. There is no dashboard to
check. There are no metrics to compare before and after. The team waits. If nobody complains
within an hour, they assume the deployment was successful.
When something does go wrong, the team finds out from a customer support ticket, a Slack message
from another team, or an executive asking why the site is slow. The investigation starts with
SSH-ing into a server and reading raw log files. Hours pass before anyone understands what
happened, what caused it, or how many users were affected.
Common variations:
- Logs exist but are not aggregated. Each server writes its own log files. Debugging requires
logging into multiple servers and running grep. Correlating a request across services means
opening terminals to five machines and searching by timestamp.
- Metrics exist but nobody watches them. A monitoring tool was set up once. It has default
dashboards for CPU and memory. Nobody configured application-level metrics. The dashboards show
that servers are running, not whether the application is working.
- Alerting is all or nothing. Either there are no alerts, or there are hundreds of noisy
alerts that the team ignores. Real problems are indistinguishable from false alarms. The
on-call person mutes their phone.
- Observability is someone else’s job. A separate operations or platform team owns the
monitoring tools. The development team does not have access, does not know what is monitored,
and does not add instrumentation to their code.
- Post-deployment verification is manual. After every deployment, someone clicks through the
application to check if it works. This takes 15 minutes per deployment. It catches obvious
failures but misses performance degradation, error rate increases, and partial outages.
The telltale sign: the team’s primary method for detecting production problems is waiting for
someone outside the team to report them.
Why This Is a Problem
Without observability, the team is deploying into a void. They cannot verify that deployments
are healthy, cannot detect problems quickly, and cannot diagnose issues when they arise. Every
deployment is a bet that nothing will go wrong, with no way to check.
It reduces quality
When the team cannot see the effects of their changes in production, they cannot learn from them.
A deployment that degrades response times by 200 milliseconds goes unnoticed. A change that
causes a 2% increase in error rates is invisible. These small quality regressions accumulate
because nobody can see them.
Without production telemetry, the team also loses the most valuable feedback loop: how the
software actually behaves under real load with real data. A test suite can verify logic, but only
production observability reveals performance characteristics, usage patterns, and failure modes
that tests cannot simulate.
Teams with strong observability catch regressions within minutes of deployment. They see error
rate spikes, latency increases, and anomalous behavior in real time. They roll back or fix the
issue before most users are affected. Quality improves because the feedback loop from deployment
to detection is minutes, not days.
It increases rework
Without observability, incidents take longer to detect, longer to diagnose, and longer to resolve.
Each phase of the incident lifecycle is extended because the team is working blind.
Detection takes hours or days instead of minutes because the team relies on external reports.
Diagnosis takes hours instead of minutes because there are no traces, no correlated logs, and no
metrics to narrow the search. The team resorts to reading code and guessing. Resolution takes
longer because without metrics, the team cannot verify that their fix actually worked - they
deploy the fix and wait to see if the complaints stop.
A team with observability detects problems in minutes through automated alerts, diagnoses them
in minutes by following traces and examining metrics, and verifies fixes instantly by watching
dashboards. The total incident lifecycle drops from hours to minutes.
It makes delivery timelines unpredictable
Without observability, the team cannot assess deployment risk. They do not know the current error
rate, the baseline response time, or the system’s capacity. Every deployment might trigger an
incident that consumes the rest of the day, or it might go smoothly. The team cannot predict
which.
This uncertainty makes the team cautious. They deploy less frequently because each deployment is
a potential fire. They avoid deploying on Fridays, before holidays, or before important events.
They batch up changes so there are fewer risky deployment moments. Each of these behaviors slows
delivery and increases batch size, which increases risk further.
Teams with observability deploy with confidence because they can verify health immediately. A
deployment that causes a problem is detected and rolled back in minutes. The blast radius is
small because the team catches issues before they spread. This confidence enables frequent
deployment, which keeps batch sizes small, which reduces risk.
Impact on continuous delivery
Continuous delivery requires fast feedback from production. The deploy-and-verify cycle must be
fast enough that the team can deploy many times per day with confidence. Without observability,
there is no verification step - only hope.
Specifically, CD requires:
- Automated deployment verification. After every deployment, the pipeline must verify that the
new version is healthy before routing traffic to it. This requires health checks, metric
comparisons, and automated rollback triggers - all of which require observability.
- Fast incident detection. If a deployment causes a problem, the team must know within
minutes, not hours. Automated alerts based on error rates, latency, and business metrics
are essential.
- Confident rollback decisions. When a deployment looks unhealthy, the team must be able to
compare current metrics to the baseline and make a data-driven rollback decision. Without
metrics, rollback decisions are based on gut feeling and anecdote.
A team without observability can automate deployment, but they cannot automate verification. That
means every deployment requires manual checking, which caps deployment frequency at whatever pace
the team can manually verify.
How to Fix It
Step 1: Add structured logging (Week 1)
Structured logging is the foundation of observability. Without it, logs are unreadable at scale.
- Replace unstructured log statements (
log("processing order")) with structured ones
(log(event="order.processed", order_id=123, duration_ms=45)).
- Include a correlation ID in every log entry so that all log entries for a single request can
be linked together across services.
- Send logs to a central aggregation service (Elasticsearch, Datadog, CloudWatch, Loki, or
similar). Stop relying on SSH and grep.
Focus on the most critical code paths first: request handling, error paths, and external service
calls. You do not need to instrument everything in week one.
Step 2: Add application-level metrics (Week 2)
Infrastructure metrics (CPU, memory, disk) tell you the servers are running. Application metrics
tell you the software is working. Add the four golden signals:
| Signal |
What to measure |
Example |
| Latency |
How long requests take |
p50, p95, p99 response time per endpoint |
| Traffic |
How much demand the system handles |
Requests per second, messages processed per minute |
| Errors |
How often requests fail |
Error rate by endpoint, HTTP 5xx count |
| Saturation |
How full the system is |
Queue depth, connection pool usage, thread count |
Expose these metrics through your application (using Prometheus client libraries, StatsD, or
your platform’s metric SDK) and visualize them on a dashboard.
Step 3: Create a deployment health dashboard (Week 3)
Build a single dashboard that answers: “Is the system healthy right now?”
- Include the four golden signals from Step 2.
- Add deployment markers so the team can see when deploys happened and correlate them with
metric changes.
- Include business metrics that matter: successful checkouts per minute, sign-ups per hour,
or whatever your system’s key transactions are.
This dashboard becomes the first thing the team checks after every deployment. It replaces the
manual click-through verification.
Step 4: Add automated alerts for deployment verification (Week 4)
Move from “someone checks the dashboard” to “the system tells us when something is wrong”:
- Set alert thresholds based on your baseline metrics. If the p95 latency is normally 200ms,
alert when it exceeds 500ms for more than 2 minutes.
- Set error rate alerts. If the error rate is normally below 1%, alert when it crosses 5%.
- Connect alerts to the team’s communication channel (Slack, PagerDuty, or similar). Alerts
must reach the people who can act on them.
Start with a small number of high-confidence alerts. Three alerts that fire reliably are worth
more than thirty that the team ignores.
Step 5: Integrate observability into the deployment pipeline (Week 5+)
Close the loop between deployment and verification:
- After deploying, the pipeline waits and checks health metrics automatically. If error rates
spike or latency degrades beyond the threshold, the pipeline triggers an automatic rollback.
- Add smoke tests that run against the live deployment and report results to the dashboard.
- Implement canary deployments or progressive rollouts that route a small percentage of traffic
to the new version and compare its metrics against the baseline before promoting.
This is the point where observability enables continuous delivery. The pipeline can deploy with
confidence because it can verify health automatically.
| Objection |
Response |
| “We don’t have budget for monitoring tools” |
Open-source stacks (Prometheus, Grafana, Loki, Jaeger) provide full observability at zero license cost. The investment is setup time, not money. |
| “We don’t have time to add instrumentation” |
Start with the deployment health dashboard. One afternoon of work gives the team more production visibility than they have ever had. Build from there. |
| “The ops team handles monitoring” |
Observability is a development concern, not just an operations concern. Developers write the code that generates the telemetry. They need access to the dashboards and alerts. |
| “We’ll add observability after we stabilize” |
You cannot stabilize what you cannot see. Observability is how you find stability problems. Adding it later means flying blind longer. |
Measuring Progress
| Metric |
What to look for |
| Mean time to detect (MTTD) |
Time from problem occurring to team being aware - should drop from hours to minutes |
| Mean time to repair |
Should decrease as diagnosis becomes faster |
| Manual verification time per deployment |
Should drop to zero as automated checks replace manual click-throughs |
| Change fail rate |
Should decrease as deployment verification catches problems before they reach users |
| Alert noise ratio |
Percentage of alerts that are actionable - should be above 80% |
| Incidents discovered by customers vs. by the team |
Ratio should shift toward team detection |
Related Content
1.7 - Architecture
Anti-patterns in system architecture and design that block continuous delivery.
These anti-patterns affect the structure of the software itself. They create coupling that
makes independent deployment impossible, blast radii that make every change risky, and
boundaries that force teams to coordinate instead of delivering independently.
1.7.1 - Tightly Coupled Monolith
Changing one module breaks others. No clear boundaries. Every change is high-risk because blast radius is unpredictable.
Category: Architecture | Quality Impact: High
What This Looks Like
A developer changes a function in the order processing module. The test suite fails in the
reporting module, the notification service, and a batch job that nobody knew existed. The
developer did not touch any of those systems. They changed one function in one file, and three
unrelated features broke.
The team has learned to be cautious. Before making any change, developers trace every caller,
every import, and every database query that might be affected. A change that should take an hour
takes a day because most of the time is spent figuring out what might break. Even after that
analysis, surprises are common.
Common variations:
- The web of shared state. Multiple modules read and write the same database tables directly.
A schema change in one module breaks queries in five others. Nobody owns the tables because
everybody uses them.
- The god object. A single class or module that everything depends on. It handles
authentication, logging, database access, and business logic. Changing it is terrifying because
the entire application runs through it.
- Transitive dependency chains. Module A depends on Module B, which depends on Module C. A
change to Module C breaks Module A through a chain that nobody can trace without a debugger.
The dependency graph is a tangle, not a tree.
- Shared libraries with hidden contracts. Internal libraries used by multiple modules with no
versioning or API stability guarantees. Updating the library for one consumer breaks another.
Teams stop updating shared libraries because the risk is too high.
- Everything deploys together. The application is a single deployable unit. Even if modules
are logically separated in the source code, they compile and ship as one artifact. A one-line
change to the login page requires deploying the entire system.
The telltale sign: developers regularly say “I don’t know what this change will affect” and
mean it. Changes routinely break features that seem unrelated.
Why This Is a Problem
Tight coupling turns every change into a gamble. The cost of a change is not proportional to its
size but to the number of hidden dependencies it touches. Small changes carry large risk, which
slows everything down.
It reduces quality
When every change can break anything, developers cannot reason about the impact of their work.
A well-bounded module lets a developer think locally: “I changed the discount calculation, so
discount-related behavior might be affected.” A tightly coupled system offers no such guarantee.
The discount calculation might share a database table with the shipping module, which triggers
a notification workflow, which updates a dashboard.
This unpredictable blast radius makes code review less effective. Reviewers can verify that the
code in the diff is correct, but they cannot verify that it is safe. The breakage happens in code
that is not in the diff - code that neither the author nor the reviewer thought to check.
In a system with clear module boundaries, the blast radius of a change is bounded by the module’s
interface. If the interface does not change, nothing outside the module can break. Developers and
reviewers can focus on the module itself and trust the boundary.
It increases rework
Tight coupling causes rework in two ways. First, unexpected breakage from seemingly safe changes
sends developers back to fix things they did not intend to touch. A one-line change that breaks
the notification system means the developer now needs to understand and fix the notification
system before their original change can ship.
Second, developers working in different parts of the codebase step on each other. Two developers
changing different modules unknowingly modify the same shared state. Both changes work
individually but conflict when merged. The merge succeeds at the code level but fails at runtime
because the shared state cannot satisfy both changes simultaneously. These bugs are expensive to
find because the failure only manifests when both changes are present.
Systems with clear boundaries minimize this interference. Each module owns its data and exposes
it through explicit interfaces. Two developers working in different modules cannot create a
hidden conflict because there is no shared mutable state to conflict on.
It makes delivery timelines unpredictable
In a coupled system, the time to deliver a change includes the time to understand the impact,
make the change, fix the unexpected breakage, and retest everything that might be affected. The
first and third steps are unpredictable because no one knows the full dependency graph.
A developer estimates a task at two days. On day one, the change is made and tests are passing.
On day two, a failing test in another module reveals a hidden dependency. Fixing the dependency
takes two more days. The task that was estimated at two days takes four. This happens often enough
that the team stops trusting estimates, and stakeholders stop trusting timelines.
The testing cost is also unpredictable. In a modular system, changing Module A means running
Module A’s tests. In a coupled system, changing anything might mean running everything. If the
full test suite takes 30 minutes, every small change requires a 30-minute feedback cycle because
there is no way to scope the impact.
It prevents independent team ownership
When the codebase is a tangle of dependencies, no team can own a module cleanly. Every change in
one team’s area risks breaking another team’s area. Teams develop informal coordination rituals:
“Let us know before you change the order table.” “Don’t touch the shared utils module without
talking to Platform first.”
These coordination costs scale quadratically with the number of teams. Two teams need one
communication channel. Five teams need ten. Ten teams need forty-five. The result is that adding
developers makes the system slower to change, not faster.
In a system with well-defined module boundaries, each team owns their modules and their data.
They deploy independently. They do not need to coordinate on internal changes because the
boundaries prevent cross-module breakage. Communication focuses on interface changes, which are
infrequent and explicit.
Impact on continuous delivery
Continuous delivery requires that any change can flow from commit to production safely and
quickly. Tight coupling breaks this in multiple ways:
- Blast radius prevents small, safe changes. If a one-line change can break unrelated
features, no change is small from a risk perspective. The team compensates by batching changes
and testing extensively, which is the opposite of continuous.
- Testing scope is unbounded. Without module boundaries, there is no way to scope testing to
the changed area. Every change requires running the full suite, which slows the pipeline and
reduces deployment frequency.
- Independent deployment is impossible. If everything must deploy together, deployment
coordination is required. Teams queue up behind each other. Deployment frequency is limited by
the slowest team.
- Rollback is risky. Rolling back one change might break something else if other changes
were deployed simultaneously. The tangle works in both directions.
A team with a tightly coupled monolith can still practice CD, but they must invest in decoupling
first. Without boundaries, the feedback loops are too slow and the blast radius is too large for
continuous deployment to be safe.
How to Fix It
Decoupling a monolith is a long-term effort. The goal is not to rewrite the system or extract
microservices on day one. The goal is to create boundaries that limit blast radius and enable
independent change. Start where the pain is greatest.
Step 1: Map the dependency hotspots (Week 1)
Identify the areas of the codebase where coupling causes the most pain:
- Use version control history to find the files that change together most frequently. Files that
always change as a group are likely coupled.
- List the modules or components that are most often involved in unexpected test failures after
changes to other areas.
- Identify shared database tables - tables that are read or written by more than one module.
- Draw the dependency graph. Tools like dependency-cruiser (JavaScript), jdepend (Java), or
similar can automate this. Look for cycles and high fan-in nodes.
Rank the hotspots by pain: which coupling causes the most unexpected breakage, the most
coordination overhead, or the most test failures?
Step 2: Define module boundaries on paper (Week 2)
Before changing any code, define where boundaries should be:
- Group related functionality into candidate modules based on business domain, not technical
layer. “Orders,” “Payments,” and “Notifications” are better boundaries than “Database,”
“API,” and “UI.”
- For each boundary, define what the public interface would be: what data crosses the boundary
and in what format?
- Identify shared state that would need to be split or accessed through interfaces.
This is a design exercise, not an implementation. The output is a diagram showing target module
boundaries with their interfaces.
Step 3: Enforce one boundary (Weeks 3-6)
Pick the boundary with the best ratio of pain-reduced to effort-required and enforce it in code:
- Create an explicit interface (API, function contract, or event) for cross-module communication.
All external callers must use the interface.
- Move shared database access behind the interface. If the payments module needs order data, it
calls the orders module’s interface rather than querying the orders table directly.
- Add a build-time or lint-time check that enforces the boundary. Fail the build if code outside
the module imports internal code directly.
This is the hardest step because it requires changing existing call sites. Use the Strangler Fig
approach: create the new interface alongside the old coupling, migrate callers one at a time, and
remove the old path when all callers have migrated.
Step 4: Scope testing to module boundaries (Week 4+)
Once a boundary exists, use it to scope testing:
- Write tests for the module’s public interface (contract tests and functional tests).
- Changes within the module only need to run the module’s own tests plus the interface tests.
If the interface tests pass, nothing outside the module can break.
- Reserve the full integration suite for deployment validation, not developer feedback.
This immediately reduces pipeline duration for changes inside the bounded module. Developers get
faster feedback. The pipeline is no longer “run everything for every change.”
Step 5: Repeat for the next boundary (Ongoing)
Each new boundary reduces blast radius, improves test scoping, and enables more independent
ownership. Prioritize by pain:
| Signal |
What it tells you |
| Files that always change together across modules |
Coupling that forces coordinated changes |
| Unexpected test failures after unrelated changes |
Hidden dependencies through shared state |
| Multiple teams needing to coordinate on changes |
Ownership boundaries that do not match code boundaries |
| Long pipeline duration from running all tests |
No way to scope testing because boundaries do not exist |
Over months, the system evolves from a tangle into a set of modules with defined interfaces. This
is not a rewrite. It is incremental boundary enforcement applied where it matters most.
| Objection |
Response |
| “We should just rewrite it as microservices” |
A rewrite takes months or years and delivers zero value until it is finished. Enforcing boundaries in the existing codebase delivers value with each boundary and does not require a big-bang migration. |
| “We don’t have time to refactor” |
You are already paying the cost of coupling in unexpected breakage, slow testing, and coordination overhead. Each boundary you enforce reduces that ongoing cost. |
| “The coupling is too deep to untangle” |
Start with the easiest boundary, not the hardest. Even one well-enforced boundary reduces blast radius and proves the approach works. |
| “Module boundaries will slow us down” |
Boundaries add a small cost to cross-module changes and remove a large cost from within-module changes. Since most changes are within a module, the net effect is faster delivery. |
Measuring Progress
| Metric |
What to look for |
| Unexpected cross-module test failures |
Should decrease as boundaries are enforced |
| Change fail rate |
Should decrease as blast radius shrinks |
| Build duration |
Should decrease as testing can be scoped to affected modules |
| Development cycle time |
Should decrease as developers spend less time tracing dependencies |
| Cross-team coordination requests per sprint |
Should decrease as module ownership becomes clearer |
| Files changed per commit |
Should decrease as changes become more localized |
Related Content
2 - Migrating Brownfield to CD
Already have a running system? A phased approach to migrating existing applications and teams to continuous delivery.
Most teams adopting CD are not starting from scratch. They have existing codebases, existing
processes, existing habits, and existing pain. This section provides the phased migration path
from where you are today to continuous delivery, without stopping feature delivery along the way.
The Reality of Brownfield Migration
Migrating an existing system to CD is harder than building CD into a greenfield project. You are
working against inertia: existing branching strategies, existing test suites (or lack thereof),
existing deployment processes, and existing team habits. Every change has to be made incrementally,
alongside regular delivery work.
The good news: every team that has successfully adopted CD has done it this way. The practices in
this guide are designed for incremental adoption, not big-bang transformation.
The Migration Phases
The migration is organized into five phases. Each phase builds on the previous one. Start with
Phase 0 to understand where you are, then work through the phases in order.
| Phase |
Name |
Goal |
Key Question |
| 0 |
Assess |
Understand where you are |
“How far are we from CD?” |
| 1 |
Foundations |
Daily integration, testing, small work |
“Can we integrate safely every day?” |
| 2 |
Pipeline |
Automated path to production |
“Can we deploy any commit automatically?” |
| 3 |
Optimize |
Improve flow, reduce batch size |
“Can we deliver small changes quickly?” |
| 4 |
Deliver on Demand |
Deploy any change when needed |
“Can we deliver any change to production when needed?” |
Where to Start
If you don’t know where you stand
Start with Phase 0 - Assess. Complete the value stream mapping exercise, take
baseline metrics, and fill out the current-state checklist. These activities tell you exactly
where you stand and which phase to begin with.
If you know your biggest pain point
Start with Anti-Patterns. Find the problem your team feels most, and follow the
links to the practices and migration phases that address it.
Quick self-assessment
If you don’t have time for a full assessment, answer these questions:
- Do all developers integrate to trunk at least daily? If no, start with
Phase 1.
- Do you have a single automated pipeline that every change goes through? If no, start with
Phase 2.
- Can you deploy any green build to production on demand? If no, focus on the gap between
your current state and Phase 2 completion criteria.
- Do you deploy at least weekly? If no, look at Phase 3 for batch size and
flow optimization.
Principles for Brownfield Migration
Do not stop delivering features
The migration is done alongside regular delivery work, not instead of it. Each practice is adopted
incrementally. You do not stop the world to rewrite your test suite or redesign your pipeline.
Fix the biggest constraint first
Use your value stream map and metrics to identify which blocker is the current constraint. Fix
that one thing. Then find the next constraint and fix that. Do not try to fix everything at once.
See Identify Constraints and the
CD Dependency Tree.
Make progress visible
Track your DORA metrics from day one: deployment frequency, lead time for changes, change failure
rate, and mean time to restore. These metrics show whether your changes are working and build the
case for continued investment.
See Baseline Metrics.
Start with one team
CD adoption works best when a single team can experiment, learn, and iterate without waiting for
organizational consensus. Once one team demonstrates results, other teams have a concrete example
to follow.
Common Brownfield Challenges
These challenges are specific to migrating existing systems. For the full catalog of problems
teams face, see Anti-Patterns.
| Challenge |
Why it’s hard |
Approach |
| Large codebase with no tests |
Writing tests retroactively is expensive and the ROI feels unclear |
Do not try to add tests to the whole codebase. Add tests to every file you touch. Use the test-for-every-bug-fix rule. Coverage grows where it matters most. |
| Long-lived feature branches |
The team has been using feature branches for years and the workflow feels safe |
Reduce branch lifetime gradually: from two weeks to one week to two days to same-day. Do not switch to trunk overnight. |
| Manual deployment process |
The “deployment expert” has a 50-step runbook in their head |
Document the manual process first. Then automate one step at a time, starting with the most error-prone step. |
| Flaky test suite |
Tests that randomly fail have trained the team to ignore failures |
Quarantine all flaky tests immediately. They do not block the build until they are fixed. Zero tolerance for new flaky tests. |
| Tightly coupled architecture |
Changing one module breaks others unpredictably |
You do not need microservices. You need clear boundaries. Start by identifying and enforcing module boundaries within the monolith. |
| Organizational resistance |
“We’ve always done it this way” |
Start small, show results, build the case with data. One team deploying daily with lower failure rates is more persuasive than any slide deck. |
Migration Timeline
These ranges assume a single team working on the migration alongside regular delivery work:
| Phase |
Typical Duration |
Biggest Variable |
| Phase 0 - Assess |
1-2 weeks |
None - just do it |
| Phase 1 - Foundations |
1-6 months |
Current testing and trunk-based development maturity |
| Phase 2 - Pipeline |
1-3 months |
Complexity of existing deployment process |
| Phase 3 - Optimize |
2-6 months |
Organizational willingness to change batch size and approval processes |
| Phase 4 - Deliver on Demand |
1-3 months |
Confidence in pipeline and rollback capability |
Do not treat these timelines as commitments. The migration is an iterative improvement process,
not a project with a deadline.
Related Content
2.1 - Document Your Current Process
Before formal value stream mapping, get the team to write down every step from “ready to push” to “running in production.” Quick wins surface immediately; the documented process becomes better input for the value stream mapping session.
The Brownfield CD overview covers the migration phases, principles, and common challenges.
This page covers the first practical step - documenting what actually happens today between a
developer finishing a change and that change running in production.
Why Document Before Mapping
Value stream mapping is a powerful tool for systemic improvement. It requires measurement, cross-team
coordination, and careful analysis. That takes time to do well, and it should not be rushed.
But you do not need a value stream map to spot obvious friction. Manual steps that could be
automated, wait times caused by batching, handoffs that exist only because of process - these
are visible the moment you write the process down.
Document your current process first. This gives you two things:
- Quick wins you can fix this week. Obvious waste that requires no measurement or
cross-team coordination to remove.
- Better input for value stream mapping. When you do the formal mapping session, the team
is not starting from a blank whiteboard. They have a shared, written description of what
actually happens, and they have already removed the most obvious friction.
Quick wins build momentum. Teams that see immediate improvements are more willing to invest in
the deeper systemic work that value stream mapping reveals.
How to Do It
Get the team together. Pick a recent change that went through the full process from “ready to
push” to “running in production.” Walk through every step that happened, in order.
The rules:
- Document what actually happens, not what should happen. If the official process says
“automated deployment” but someone actually SSH-es into a server and runs a script, write
down the SSH step.
- Include the invisible steps. The Slack message asking for review. The email requesting
deploy approval. The wait for the Tuesday deploy window. These are often the biggest sources
of delay and they are usually missing from official process documentation.
- Get the whole team in the room. Different people see different parts of the process. The
developer who writes the code may not know what happens after the merge. The ops person who
runs the deploy may not know about the QA handoff. You need every perspective.
- Write it down as an ordered list. Not a flowchart, not a diagram, not a wiki page with
sections. A simple numbered list of steps in the order they actually happen.
What to Capture for Each Step
For every step in the process, capture these details:
| Field |
What to Write |
Example |
| Step name |
What happens, in plain language |
“QA runs manual regression tests” |
| Who does it |
Person or role responsible |
“QA engineer on rotation” |
| Manual or automated |
Is this step done by a human or by a tool? |
“Manual” |
| Typical duration |
How long the step itself takes |
“4 hours” |
| Wait time before it starts |
How long the change sits before this step begins |
“1-2 days (waits for QA availability)” |
| What can go wrong |
Common failure modes for this step |
“Tests find a bug, change goes back to dev” |
The wait time column is usually more revealing than the duration column. A deploy that takes 10
minutes but only happens on Tuesdays has up to 7 days of wait time. The step itself is not the
problem - the batching is.
Example: A Typical Brownfield Process
This is a realistic example of what a brownfield team’s process might look like before any CD
practices are adopted. Your process will differ, but the pattern of manual steps and wait times
is common.
| # |
Step |
Who |
Manual/Auto |
Duration |
Wait Before |
What Can Go Wrong |
| 1 |
Push to feature branch |
Developer |
Manual |
Minutes |
None |
Merge conflicts with other branches |
| 2 |
Open pull request |
Developer |
Manual |
10 min |
None |
Forgot to update tests |
| 3 |
Wait for code review |
Developer (waiting) |
Manual |
- |
4 hours to 2 days |
Reviewer is busy, PR sits |
| 4 |
Address review feedback |
Developer |
Manual |
30 min to 2 hours |
- |
Multiple rounds of feedback |
| 5 |
Merge to main branch |
Developer |
Manual |
Minutes |
- |
Merge conflicts from stale branch |
| 6 |
CI runs (build + unit tests) |
CI server |
Automated |
15 min |
Minutes |
Flaky tests cause false failures |
| 7 |
QA picks up ticket from board |
QA engineer |
Manual |
- |
1-3 days |
QA backlog, other priorities |
| 8 |
Manual functional testing |
QA engineer |
Manual |
2-4 hours |
- |
Finds bug, sends back to dev |
| 9 |
Request deploy approval |
Team lead |
Manual |
5 min |
- |
Approver is on vacation |
| 10 |
Wait for deploy window |
Everyone (waiting) |
- |
- |
1-7 days (deploys on Tuesdays) |
Window missed, wait another week |
| 11 |
Ops runs deployment |
Ops engineer |
Manual |
30 min |
- |
Script fails, manual rollback |
| 12 |
Smoke test in production |
Ops engineer |
Manual |
15 min |
- |
Finds issue, emergency rollback |
Total typical time: 3 to 14 days from “ready to push” to “running in production.”
Even before measurement or analysis, patterns jump out:
- Steps 3, 7, and 10 are pure wait time - nothing is happening to the change.
- Steps 8 and 12 are manual testing that could potentially be automated.
- Step 10 is artificial batching - deploys happen on a schedule, not on demand.
- Step 9 might be a rubber-stamp approval that adds delay without adding safety.
Spotting Quick Wins
Once the process is documented, look for these patterns. Each one is a potential quick win that
the team can fix without a formal improvement initiative.
Automation targets
Steps that are purely manual but have well-known automation:
- Code formatting and linting. If reviewers spend time on style issues, add a linter to CI.
This saves reviewer time on every single PR.
- Running tests. If someone manually runs tests before merging, make CI run them
automatically on every push.
- Build and package. If someone manually builds artifacts, automate the build in the
pipeline.
- Smoke tests. If someone manually clicks through the app after deploy, write a small set
of automated smoke tests.
Batching delays
Steps where changes wait for a scheduled event:
- Deploy windows. “We deploy on Tuesdays” means every change waits an average of 3.5 days.
Moving to deploy-on-demand (even if still manual) removes this wait entirely.
- QA batches. “QA tests the release candidate” means changes queue up. Testing each change
as it merges removes the batch.
- CAB meetings. “The change advisory board meets on Thursdays” adds up to a week of wait
time per change.
Process-only handoffs
Steps where work moves between people not because of a skill requirement, but because of
process:
- QA sign-off that is a rubber stamp. If QA always approves and never finds issues, the
sign-off is not adding value.
- Approval steps that are never rejected. Track the rejection rate. If an approval step
has a 0% rejection rate over the last 6 months, it is ceremony, not a gate.
- Handoffs between people who sit next to each other. If the developer could do the step
themselves but “process says” someone else has to, question the process.
Unnecessary steps
Steps that exist because of historical reasons and no longer serve a purpose:
- Manual steps that duplicate automated checks. If CI runs the tests and someone also runs
them manually “just to be sure,” the manual run is waste.
- Approvals for low-risk changes. Not every change needs the same level of scrutiny. A
typo fix in documentation does not need a CAB review.
Quick Wins vs. Value Stream Improvements
Not everything you find in the documented process is a quick win. Distinguish between the two:
|
Quick Wins |
Value Stream Improvements |
| Scope |
Single team can fix |
Requires cross-team coordination |
| Timeline |
Days to a week |
Weeks to months |
| Measurement |
Obvious before/after |
Requires baseline metrics and tracking |
| Risk |
Low - small, reversible changes |
Higher - systemic process changes |
| Examples |
Add linter to CI, remove rubber-stamp approval, enable on-demand deploys |
Restructure testing strategy, redesign deployment pipeline, change team topology |
Do the quick wins now. Do not wait for the value stream mapping session. Every manual step
you remove this week is one less step cluttering the value stream map and one less source of
friction for the team.
Bring the documented process to the value stream mapping session. The team has already
aligned on what actually happens, removed the obvious waste, and built some momentum. The value
stream mapping session can focus on the systemic issues that require measurement, cross-team
coordination, and deeper analysis.
What Comes Next
- Fix the quick wins. Assign each one to someone with a target of this week or next week.
Do not create a backlog of improvements that sits untouched.
- Schedule the value stream mapping session. Use the documented process as the starting
point. See Value Stream Mapping.
- Start the replacement cycle. For manual validations that are not quick wins, use the
Replacing Manual Validations cycle to systematically
automate and remove them.
Related Content
2.2 - Replacing Manual Validations with Automation
The repeating mechanical cycle at the heart of every brownfield CD migration: identify a manual validation, automate it, prove the automation works, and remove the manual step.
The Brownfield CD overview covers the migration phases, principles, and common challenges.
This page covers the core mechanical process - the specific, repeating cycle of replacing
manual validations with automation that drives every phase forward.
The Replacement Cycle
Every brownfield CD migration follows the same four-step cycle, repeated until no manual
validations remain between commit and production:
- Identify a manual validation in the delivery process.
- Automate the check so it runs in the pipeline without human intervention.
- Validate that the automation catches the same problems the manual step caught.
- Remove the manual step from the process.
Then pick the next manual validation and repeat.
Two rules make this cycle work:
- Do not skip “validate.” Run the manual and automated checks in parallel long enough to
prove the automation catches what the manual step caught. Without this evidence, the team will
not trust the automation, and the manual step will creep back.
- Do not skip “remove.” Keeping both the manual and automated checks adds cost without
removing it. The goal is replacement, not duplication. Once the automated check is proven,
retire the manual step explicitly.
Inventory Your Manual Validations
Before you can replace manual validations, you need to know what they are. A
value stream map is the fastest way to find them. Walk the
path from commit to production and mark every point where a human has to inspect, approve, verify,
or execute something before the change can move forward.
Common manual validations and where they typically live:
| Manual Validation |
Where It Lives |
What It Catches |
| Manual regression testing |
QA team runs test cases before release |
Functional regressions in existing features |
| Code style review |
PR review checklist |
Formatting, naming, structural consistency |
| Security review |
Security team sign-off before deploy |
Vulnerable dependencies, injection risks, auth gaps |
| Environment configuration |
Ops team configures target environment |
Missing env vars, wrong connection strings, incorrect feature flags |
| Smoke testing |
Someone clicks through the app after deploy |
Deployment-specific failures, broken integrations |
| Change advisory board |
CAB meeting approves production changes |
Risk assessment, change coordination, rollback planning |
| Database migration review |
DBA reviews and runs migration scripts |
Schema conflicts, data loss, performance regressions |
Your inventory will include items not on this list. That is expected. The list above covers the
most common ones, but every team has process-specific manual steps that accumulated over time.
Prioritize by Effort and Friction
Not all manual validations are equal. Some cause significant delay on every release. Others are
quick and infrequent. Prioritize by mapping each validation on two axes:
Friction (vertical axis - how much pain the manual step causes):
- How often does it run? (every commit, every release, quarterly)
- How long does it take? (minutes, hours, days)
- How often does it produce errors? (rarely, sometimes, frequently)
High-frequency, long-duration, error-prone validations cause the most friction.
Effort to automate (horizontal axis - how hard is the automation):
- Is the codebase ready? (clean interfaces vs. tightly coupled)
- Do tools exist? (linters, test frameworks, scanning tools)
- Is the validation well-defined? (clear pass/fail vs. subjective judgment)
Start with high-friction, low-effort validations. These give you the fastest return and build
momentum for harder automations later. This is the same constraint-based thinking described in
Identify Constraints - fix the biggest bottleneck first.
|
Low Effort |
High Effort |
| High Friction |
Start here - fastest return |
Plan these - high value but need investment |
| Low Friction |
Do these opportunistically |
Defer - low return for high cost |
Walkthrough: Replacing Manual Regression Testing
A concrete example of the full cycle applied to a common brownfield problem.
Starting state
The QA team runs 200 manual test cases before every release. The full regression suite takes three
days. Releases happen every two weeks, so the team spends roughly 20% of every sprint on manual
regression testing.
Step 1: Identify
The value stream map shows the 3-day manual regression cycle as the single largest wait time
between “code complete” and “deployed.” This is the constraint.
Step 2: Automate (start small)
Do not attempt to automate all 200 test cases at once. Rank the test cases by two criteria:
- Failure frequency: Which tests actually catch bugs? (In most suites, a small number of
tests catch the majority of real regressions.)
- Business criticality: Which tests cover the highest-risk functionality?
Pick the top 20 test cases by these criteria. Write automated tests for those 20 first. This is
enough to start the validation step.
Step 3: Validate (parallel run)
Run the 20 automated tests alongside the full manual regression suite for two or three release
cycles. Compare results:
- Did the automated tests catch the same failures the manual tests caught?
- Did the automated tests miss anything the manual tests caught?
- Did the automated tests catch anything the manual tests missed?
Track these results explicitly. They are the evidence the team needs to trust the automation.
Step 4: Remove
Once the automated tests have proven equivalent for those 20 test cases across multiple cycles,
remove those 20 test cases from the manual regression suite. The manual suite is now 180 test
cases - taking roughly 2.7 days instead of 3.
Repeat
Pick the next 20 highest-value test cases. Automate them. Validate with parallel runs. Remove the
manual cases. The manual suite shrinks with each cycle:
| Cycle |
Manual Test Cases |
Manual Duration |
Automated Tests |
| Start |
200 |
3.0 days |
0 |
| 1 |
180 |
2.7 days |
20 |
| 2 |
160 |
2.4 days |
40 |
| 3 |
140 |
2.1 days |
60 |
| 4 |
120 |
1.8 days |
80 |
| 5 |
100 |
1.5 days |
100 |
Each cycle also gets faster because the team builds skill and the test infrastructure matures.
For more on structuring automated tests effectively, see
Testing Fundamentals and
Functional Testing.
When Refactoring Is a Prerequisite
Sometimes you cannot automate a validation because the code is not structured for it. In these
cases, refactoring is a prerequisite step within the replacement cycle - not a separate initiative.
| Code-Level Blocker |
Why It Prevents Automation |
Refactoring Approach |
| Tight coupling between modules |
Cannot test one module without setting up the entire system |
Extract interfaces at module boundaries so modules can be tested in isolation |
| Hardcoded configuration |
Cannot run the same code in test and production environments |
Extract configuration into environment variables or config files |
| No clear entry points |
Cannot call business logic without going through the UI |
Extract business logic into callable functions or services |
| Shared mutable state |
Test results depend on execution order and are not repeatable |
Isolate state by passing dependencies explicitly instead of using globals |
| Scattered database access |
Cannot test logic without a running database and specific data |
Consolidate data access behind a repository layer that can be substituted in tests |
The key discipline: refactor only the minimum needed for the specific validation you are
automating. Do not expand the refactoring scope beyond what the current cycle requires. This keeps
the refactoring small, low-risk, and tied to a concrete outcome.
For more on decoupling strategies, see
Architecture Decoupling.
The Compounding Effect
Each completed replacement cycle frees time that was previously spent on manual validation. That
freed time becomes available for the next automation cycle. The pace of migration accelerates as
you progress:
| Cycle |
Manual Time per Release |
Time Available for Automation |
Cumulative Automated Checks |
| Start |
5 days |
Limited (squeezed between feature work) |
0 |
| After 2 cycles |
4 days |
1 day freed |
2 validations automated |
| After 4 cycles |
3 days |
2 days freed |
4 validations automated |
| After 6 cycles |
2 days |
3 days freed |
6 validations automated |
| After 8 cycles |
1 day |
4 days freed |
8 validations automated |
Early cycles are the hardest because you have the least available time. This is why starting with
the highest-friction, lowest-effort validation matters - it frees the most time for the least
investment.
The same compounding dynamic applies to
small batches - smaller changes are easier to validate, which
makes each cycle faster, which enables even smaller changes.
Small Steps in Everything
The replacement cycle embodies the same small-batch discipline that CD itself requires. The
principle applies at every level of the migration:
- Automate one validation at a time. Do not try to build the entire pipeline in one sprint.
- Refactor one module at a time. Do not launch a “tech debt initiative” to restructure the
whole codebase before you can automate anything.
- Remove one manual check at a time. Do not announce “we are eliminating manual QA” and try
to do it all at once.
The risk of big-step migration:
- The work stalls because the scope is too large to complete alongside feature delivery.
- ROI is distant because nothing is automated until everything is automated.
- Feature delivery suffers because the team is consumed by a transformation project instead of
delivering value.
This connects directly to the brownfield migration principle:
do not stop delivering features. The replacement cycle is designed to produce value at every
iteration, not only at the end.
For more on decomposing work into small steps, see
Work Decomposition.
Measuring Progress
Track these metrics to gauge migration progress. Start collecting them from
baseline before you begin replacing validations.
| Metric |
What It Tells You |
Target Direction |
| Manual validations remaining |
How many manual steps still exist between commit and production |
Down to zero |
| Time spent on manual validation per release |
How much calendar time manual checks consume each release cycle |
Decreasing each quarter |
| Pipeline coverage % |
What percentage of validations are automated in the pipeline |
Increasing toward 100% |
| Deployment frequency |
How often you deploy to production |
Increasing |
| Lead time for changes |
Time from commit to production |
Decreasing |
If manual validations remaining is decreasing but deployment frequency is not increasing, you may
be automating low-friction validations that are not on the critical path. Revisit your
prioritization and focus on the validations that are actually blocking faster delivery.
Related Content
3 - Migration Path
A phased approach to adopting continuous delivery, from assessing your current state through full continuous deployment.
The Migration Path is a structured, phased journey from wherever you are today to continuous
deployment. Each phase builds on the previous one, so work through them in order.
The Phases
| Phase |
Focus |
Key Question |
| 0 - Assess |
Understand your current state |
How far are we from CD? |
| 1 - Foundations |
Daily integration, testing, small batches |
Can we integrate safely every day? |
| 2 - Pipeline |
Automated path from commit to production |
Can we deploy any commit automatically? |
| 3 - Optimize |
Reduce batch size, limit WIP, measure |
Can we deliver small changes quickly? |
| 4 - Deliver on Demand |
Deploy any change when the business needs it |
Can we deliver any change to production when needed? |
Where to Start
If you are unsure where to begin, start with Phase 0: Assess to understand your
current state and identify the constraints holding you back.
3.1 - Phase 0: Assess
Understand where you are today. Map your delivery process, measure what matters, and identify the constraints holding you back.
Key question: “How far are we from CD?”
Before changing anything, you need to understand your current state. This phase helps you
create a clear picture of your delivery process, establish baseline metrics, and identify
the constraints that will guide your improvement roadmap.
What You’ll Do
- Map your value stream - Visualize the flow from idea to production
- Establish baseline metrics - Measure your current delivery performance
- Identify constraints - Find the bottlenecks limiting your flow
- Complete the current-state checklist - Self-assess against MinimumCD practices
Why This Phase Matters
Teams that skip assessment often invest in the wrong improvements. A team with a 3-week manual
testing cycle doesn’t need better deployment automation first - they need testing fundamentals.
Understanding your constraints ensures you invest effort where it will have the biggest impact.
When You’re Ready to Move On
You’re ready for Phase 1: Foundations when you can answer:
- What does our value stream look like end-to-end?
- What are our current lead time, deployment frequency, and change failure rate?
- What are the top 3 constraints limiting our delivery flow?
- Which MinimumCD practices are we missing?
3.1.1 - Value Stream Mapping
Visualize your delivery process end-to-end to identify waste and constraints before starting your CD migration.
Phase 0 - Assess | Adapted from Dojo Consortium
Before you change anything about how your team delivers software, you need to see how it works
today. Value Stream Mapping (VSM) is the single most effective tool for making your delivery
process visible. It reveals the waiting, the rework, and the handoffs that you have learned to
live with but that are silently destroying your flow.
In the context of a CD migration, a value stream map is not an academic exercise. It is the
foundation for every decision you will make in the phases ahead. It tells you where your time
goes, where quality breaks down, and which constraint to attack first.
What Is a Value Stream Map?
A value stream map is a visual representation of every step required to deliver a change from
request to production. For each step, you capture:
- Process time - the time someone is actively working on that step
- Wait time - the time the work sits idle between steps (in a queue, awaiting approval, blocked on an environment)
- Percent Complete and Accurate (%C/A) - the percentage of work arriving at this step that is usable without rework
The ratio of process time to total time (process time + wait time) is your flow efficiency.
Most teams are shocked to discover that their flow efficiency is below 15%, meaning that for
every hour of actual work, there are nearly six hours of waiting.
Prerequisites
Before running a value stream mapping session, make sure you have:
- An established, repeatable process. You are mapping what actually happens, not what should
happen. If every change follows a different path, start by agreeing on the current “most common”
path.
- All stakeholders in the room. You need representatives from every group involved in delivery:
developers, testers, operations, security, product, change management. Each person knows the
wait times and rework loops in their part of the stream that others cannot see.
- A shared understanding of wait time vs. process time. Wait time is when work sits idle. Process
time is when someone is actively working. A code review that takes “two days” but involves 30
minutes of actual review has 30 minutes of process time and roughly 15.5 hours of wait time.
Choose Your Mapping Approach
Value stream maps can be built from two directions. Most organizations benefit from starting
bottom-up and then combining into a top-down view, but the right choice depends on where your
delivery pain is concentrated.
Bottom-Up: Map at the Team Level First
Each delivery team maps its own process independently - from the moment a developer is ready to
push a change to the moment that change is running in production. This is the approach described
in Document Your Current Process, elevated to a
formal value stream map with measured process times, wait times, and %C/A.
When to use bottom-up:
- You have multiple teams that each own their own deployment process (or think they do).
- Teams have different pain points and different levels of CD maturity.
- You want each team to own its improvement work rather than waiting for an organizational
initiative.
How it works:
- Each team maps its own value stream using the session format described below.
- Teams identify and fix their own constraints. Many constraints are local - flaky tests,
manual deployment steps, slow code review - and do not require cross-team coordination.
- After teams have mapped and improved their own streams, combine the maps to reveal
cross-team dependencies. Lay the team-level maps side by side and draw the connections:
shared environments, shared libraries, shared approval processes, upstream/downstream
dependencies.
The combined view often reveals constraints that no single team can see: a shared staging
environment that serializes deployments across five teams, a security review team that is
the bottleneck for every release, or a shared library with a release cycle that blocks
downstream teams for weeks.
Advantages: Fast to start, builds team ownership, surfaces team-specific friction that
a high-level map would miss. Teams see results quickly, which builds momentum for the
harder cross-team work.
Top-Down: Map Across Dependent Teams
Start with the full flow from a customer request (or business initiative) entering the system
to the delivered outcome in production, mapping across every team the work touches. This
produces a single map that shows the end-to-end flow including all inter-team handoffs,
shared queues, and organizational boundaries.
When to use top-down:
- Delivery pain is concentrated at the boundaries between teams, not within them.
- A single change routinely touches multiple teams (front-end, back-end, platform,
data, etc.) and the coordination overhead dominates cycle time.
- Leadership needs a full picture of organizational delivery performance to prioritize
investment.
How it works:
- Identify a representative value stream - a type of work that flows through the teams
you want to map. For example: “a customer-facing feature that requires API changes,
a front-end update, and a database migration.”
- Get representatives from every team in the room. Each person maps their team’s portion
of the flow, including the handoff to the next team.
- Connect the segments. The gaps between teams - where work queues, waits for
prioritization, or gets lost in a ticket system - are usually the largest sources of
delay.
Advantages: Reveals organizational constraints that team-level maps cannot see.
Shows the true end-to-end lead time including inter-team wait times. Essential for
changes that require coordinated delivery across multiple teams.
Combining Both Approaches
The most effective strategy for large organizations:
- Start bottom-up. Have each team document its current process
and then run its own value stream mapping session. Fix team-level quick wins immediately.
- Combine into a top-down view. Once team-level maps exist, connect them to see the
full organizational flow. The team-level detail makes the top-down map more accurate
because each segment was mapped by the people who actually do the work.
- Fix constraints at the right level. Team-level constraints (flaky tests, manual
deploys) are fixed by the team. Cross-team constraints (shared environments, approval
bottlenecks, dependency coordination) are fixed at the organizational level.
This layered approach prevents two common failure modes: mapping at too high a level (which
misses team-specific friction) and mapping only at the team level (which misses the
organizational constraints that dominate end-to-end lead time).
How to Run the Session
Step 1: Start From Delivery, Work Backward
Begin at the right side of your map - the moment a change reaches production. Then work backward
through every step until you reach the point where a request enters the system. This prevents teams
from getting bogged down in the early stages and never reaching the deployment process, which is
often where the largest delays hide.
Typical steps you will uncover include:
- Request intake and prioritization
- Story refinement and estimation
- Development (coding)
- Code review
- Build and unit tests
- Integration testing
- Manual QA / regression testing
- Security review
- Staging deployment
- User acceptance testing (UAT)
- Change advisory board (CAB) approval
- Production deployment
- Production verification
Step 2: Capture Process Time and Wait Time for Each Step
For each step on the map, record the process time and the wait time. Use averages if exact numbers
are not available, but prefer real data from your issue tracker, CI system, or deployment logs
when you can get it.
Migration Tip
Pay close attention to these migration-critical delays:
- Handoffs that block flow - Every time work passes from one team or role to another (dev to QA,
QA to ops, ops to security), there is a queue. Count the handoffs. Each one is a candidate for
elimination or automation.
- Manual gates - CAB approvals, manual regression testing, sign-off meetings. These often add
days of wait time for minutes of actual value.
- Environment provisioning delays - If developers wait hours or days for a test environment,
that is a constraint you will need to address in Phase 2.
- Rework loops - Any step where work frequently bounces back to a previous step. Track the
percentage of times this happens. These loops are destroying your cycle time.
Step 3: Calculate %C/A at Each Step
Percent Complete and Accurate measures the quality of the handoff. Ask each person: “What
percentage of the work you receive from the previous step is usable without needing clarification,
correction, or rework?”
A low %C/A at a step means the upstream step is producing defective output. This is critical
information for your migration plan because it tells you where quality needs to be built in
rather than inspected after the fact.
Step 4: Identify Constraints (Kaizen Bursts)
Mark the steps with the largest wait times and the lowest %C/A with a “kaizen burst” - a starburst
symbol indicating an improvement opportunity. These are your constraints. They will become the
focus of your migration roadmap.
Common constraints teams discover during their first value stream map:
| Constraint |
Typical Impact |
Migration Phase to Address |
| Long-lived feature branches |
Days of integration delay, merge conflicts |
Phase 1 (Trunk-Based Development) |
| Manual regression testing |
Days to weeks of wait time |
Phase 1 (Testing Fundamentals) |
| Environment provisioning |
Hours to days of wait time |
Phase 2 (Production-Like Environments) |
| CAB / change approval boards |
Days of wait time per deployment |
Phase 2 (Pipeline Architecture) |
| Manual deployment process |
Hours of process time, high error rate |
Phase 2 (Single Path to Production) |
| Large batch releases |
Weeks of accumulation, high failure rate |
Phase 3 (Small Batches) |
Reading the Results
Once your map is complete, calculate these summary numbers:
- Total lead time = sum of all process times + all wait times
- Total process time = sum of just the process times
- Flow efficiency = total process time / total lead time * 100
- Number of handoffs = count of transitions between different teams or roles
- Rework percentage = percentage of changes that loop back to a previous step
These numbers become part of your baseline metrics and feed directly into
your work to identify constraints.
What Good Looks Like
You are not aiming for a perfect value stream map. You are aiming for a shared, honest picture of
reality that the whole team agrees on. The map should be:
- Visible - posted on a wall or in a shared digital tool where the team sees it daily
- Honest - reflecting what actually happens, including the workarounds and shortcuts
- Actionable - with constraints clearly marked so the team knows where to focus
You will revisit and update this map as you progress through each migration phase. It is a living
document, not a one-time exercise.
Next Step
With your value stream map in hand, proceed to Baseline Metrics to
quantify your current delivery performance.
This content is adapted from the Dojo Consortium,
licensed under CC BY 4.0.
3.1.2 - Baseline Metrics
Establish baseline measurements for your current delivery performance before making any changes.
Phase 0 - Assess | Adapted from Dojo Consortium
You cannot improve what you have not measured. Before making any changes to your delivery process,
you need to capture baseline measurements of your current performance. These baselines serve two
purposes: they help you identify where to focus your migration effort, and they give you an
honest “before” picture so you can demonstrate progress as you improve.
This is not about building a sophisticated metrics dashboard. It is about getting four numbers
written down so you have a starting point.
Why Measure Before Changing
Teams that skip baseline measurement fall into predictable traps:
- They cannot prove improvement. Six months into a migration, leadership asks “What has gotten
better?” Without a baseline, the answer is a shrug and a feeling.
- They optimize the wrong thing. Without data, teams default to fixing what is most visible or
most annoying rather than what is the actual constraint.
- They cannot detect regression. A change that feels like an improvement may actually make
things worse in ways that are not immediately obvious.
Baselines do not need to be precise to the minute. A rough but honest measurement is vastly more
useful than no measurement at all.
The Four Essential Metrics
The DORA research program (now part of Google Cloud) identified four key metrics that predict
software delivery performance and organizational outcomes. These are the metrics you should
baseline first.
1. Deployment Frequency
What it measures: How often your team deploys to production.
How to capture it: Count the number of production deployments in the last 30 days. Check your
deployment logs, CI/CD system, or change management records. If deployments are rare enough that
you remember each one, count from memory.
What it tells you:
| Frequency |
What It Suggests |
| Multiple times per day |
You may already be practicing continuous delivery |
| Once per week |
You have a regular cadence but likely batch changes |
| Once per month or less |
Large batches, high risk per deployment, likely manual process |
| Varies wildly |
No consistent process; deployments are event-driven |
Record your number: ______ deployments in the last 30 days.
2. Lead Time for Changes
What it measures: The elapsed time from when code is committed to when it is running in
production.
How to capture it: Pick your last 5-10 production deployments. For each one, find the commit
timestamp of the oldest change included in that deployment and subtract it from the deployment
timestamp. Take the median.
If your team uses feature branches, the clock starts at the first commit on the branch, not when
the branch is merged. This captures the true elapsed time the change spent in the system.
What it tells you:
| Lead Time |
What It Suggests |
| Less than 1 hour |
Fast flow, likely small batches and good automation |
| 1 day to 1 week |
Reasonable but with room for improvement |
| 1 week to 1 month |
Significant queuing, likely large batches or manual gates |
| More than 1 month |
Major constraints in testing, approval, or deployment |
Record your number: ______ median lead time for changes.
3. Change Failure Rate
What it measures: The percentage of deployments to production that result in a degraded
service requiring remediation (rollback, hotfix, patch, or incident).
How to capture it: Look at your last 20-30 production deployments. Count how many caused an
incident, required a rollback, or needed an immediate hotfix. Divide by the total number of
deployments.
What it tells you:
| Failure Rate |
What It Suggests |
| 0-5% |
Strong quality practices and small change sets |
| 5-15% |
Typical for teams with some automation |
| 15-30% |
Quality gaps, likely insufficient testing or large batches |
| Above 30% |
Systemic quality problems; changes are frequently broken |
Record your number: ______ % of deployments that required remediation.
4. Mean Time to Restore (MTTR)
What it measures: How long it takes to restore service after a production failure caused by a
deployment.
How to capture it: Look at your production incidents from the last 3-6 months. For each
incident caused by a deployment, note the time from detection to resolution. Take the median.
If you have not had any deployment-caused incidents, note that - it either means your quality
is excellent or your deployment frequency is so low that you have insufficient data.
What it tells you:
| MTTR |
What It Suggests |
| Less than 1 hour |
Good incident response, likely automated rollback |
| 1-4 hours |
Manual but practiced recovery process |
| 4-24 hours |
Significant manual intervention required |
| More than 1 day |
Serious gaps in observability or rollback capability |
Record your number: ______ median time to restore service.
Capturing Your Baselines
You do not need specialized tooling to capture these four numbers. Here is a practical approach:
- Check your CI/CD system. Most CI/CD tools (Jenkins, GitHub Actions, GitLab CI, Azure
DevOps) have deployment history. Export the last 30-90 days of deployment records.
- Check your incident tracker. Pull incidents from the last 3-6 months and filter for
deployment-caused issues.
- Check your version control. Git log data combined with deployment timestamps gives you
lead time.
- Ask the team. If data is scarce, have a conversation with the team. Experienced team
members can provide reasonable estimates for all four metrics.
Record these numbers somewhere the whole team can see them. A wiki page, a whiteboard, a shared
document - the format does not matter. What matters is that they are written down and dated.
What About Automation?
If you already have a CI/CD system that tracks deployments, you can extract most of these numbers
programmatically. But do not let the pursuit of automation delay your baseline. A spreadsheet
with manually gathered numbers is perfectly adequate for Phase 0. You will build more
sophisticated measurement into your pipeline in Phase 2.
What Your Baselines Tell You About Where to Focus
Your baseline metrics point toward specific constraints:
| Signal |
Likely Constraint |
Where to Look |
| Low deployment frequency + high lead time |
Large batches, manual process |
Value Stream Map for queue times |
| High change failure rate |
Insufficient testing, poor quality practices |
Testing Fundamentals |
| High MTTR |
No rollback capability, poor observability |
Rollback |
| High lead time + low change failure rate |
Excessive manual gates adding delay but not value |
Identify Constraints |
Use these signals alongside your value stream map to identify your top constraints.
A Warning About Metrics
Goodhart's Law
“When a measure becomes a target, it ceases to be a good measure.”
These metrics are diagnostic tools, not performance targets. The moment you use them to compare
teams, rank individuals, or set mandated targets, people will optimize for the metric rather
than for actual delivery improvement. A team can trivially improve their deployment frequency
number by deploying empty changes, or reduce their change failure rate by never deploying anything
risky.
Use these metrics within the team, for the team. Share trends with leadership if needed, but
never publish team-level metrics as a leaderboard. The goal is to help each team understand
their own delivery health, not to create competition.
Next Step
With your baselines recorded, proceed to Identify Constraints to
determine which bottleneck to address first.
This content is adapted from the Dojo Consortium,
licensed under CC BY 4.0.
3.1.3 - Identify Constraints
Use your value stream map and baseline metrics to find the bottlenecks that limit your delivery flow.
Your value stream map shows you where time goes. Your
baseline metrics tell you how fast and how safely you deliver. Now you
need to answer the most important question in your migration: What is the one thing most
limiting your delivery flow right now?
This is not a question you answer by committee vote or gut feeling. It is a question you answer
with the data you have already collected.
The Theory of Constraints
Eliyahu Goldratt’s Theory of Constraints offers a simple and powerful insight: every system has
exactly one constraint that limits its overall throughput. Improving anything other than that
constraint does not improve the system.
Consider a delivery process where code review takes 30 minutes but the queue to get a review
takes 2 days, and manual regression testing takes 5 days after that. If you invest three months
building a faster build pipeline that saves 10 minutes per build, you have improved something
that is not the constraint. The 5-day regression testing cycle still dominates your lead time.
You have made a non-bottleneck more efficient, which changes nothing about how fast you deliver.
The implication for your CD migration is direct: you must find and address constraints in order
of impact. Fix the biggest one first. Then find the next one. Then fix that. This is how you
make sustained, measurable progress rather than spreading effort across improvements that do not
move the needle.
Common Constraint Categories
Software delivery constraints tend to cluster into a few recurring categories. As you review your
value stream map, look for these patterns.
Testing Bottlenecks
Symptoms: Large wait time between “code complete” and “verified.” Manual regression test
cycles measured in days or weeks. Low %C/A at the testing step, indicating frequent rework.
High change failure rate in your baseline metrics despite significant testing effort.
What is happening: Testing is being done as a phase after development rather than as a
continuous activity during development. Manual test suites have grown to cover every scenario
ever encountered, and running them takes longer with every release. The test environment is
shared and frequently broken.
Migration path: Phase 1 - Testing Fundamentals
Deployment Gates
Symptoms: Wait times of days or weeks between “tested” and “deployed.” Change Advisory Board
(CAB) meetings that happen weekly or biweekly. Multiple sign-offs required from people who are
not involved in the actual change.
What is happening: The organization has substituted process for confidence. Because
deployments have historically been risky (large batches, manual processes, poor rollback), layers
of approval have been added. These approvals add delay but rarely catch issues that automated
testing would not. They exist because the deployment process is not trustworthy, and they
persist because removing them feels dangerous.
Migration path: Phase 2 - Pipeline Architecture and
building the automated quality evidence that makes manual approvals unnecessary.
Environment Provisioning
Symptoms: Developers waiting hours or days for a test or staging environment. “Works on my
machine” failures when code reaches a shared environment. Environments that drift from production
configuration over time.
What is happening: Environments are manually provisioned, shared across teams, and treated as
pets rather than cattle. There is no automated way to create a production-like environment on
demand. Teams queue for shared environments, and environment configuration has diverged from
production.
Migration path: Phase 2 - Production-Like Environments
Code Review Delays
Symptoms: Pull requests sitting open for more than a day. Review queues with 5 or more
pending reviews. Developers context-switching because they are blocked waiting for review.
What is happening: Code review is being treated as an asynchronous handoff rather than a
collaborative activity. Reviews happen when the reviewer “gets to it” rather than as a
near-immediate response. Large pull requests make review daunting, which increases queue time
further.
Migration path: Phase 1 - Code Review and
Trunk-Based Development to reduce branch lifetime
and review size.
Manual Handoffs
Symptoms: Multiple steps in your value stream map where work transitions from one team to
another. Tickets being reassigned across teams. “Throwing it over the wall” language in how people
describe the process.
What is happening: Delivery is organized as a sequence of specialist stages (dev, test, ops,
security) rather than as a cross-functional flow. Each handoff introduces a queue, a context
loss, and a communication overhead. The more handoffs, the longer the lead time and the more
likely that information is lost.
Migration path: This is an organizational constraint, not a technical one. It is addressed
gradually through cross-functional team formation and by automating the specialist activities
into the pipeline so that handoffs become automated checks rather than manual transfers.
Using Your Value Stream Map to Find the Constraint
Pull out your value stream map and follow this process:
Step 1: Rank Steps by Wait Time
List every step in your value stream and sort them by wait time, longest first. Your biggest
constraint is almost certainly in the top three. Wait time is more important than process time
because wait time is pure waste - nothing is happening, no value is being created.
Step 2: Look for Rework Loops
Identify steps where work frequently loops back. A testing step with a 40% rework rate means
that nearly half of all changes go through the development-to-test cycle twice. The effective
wait time for that step is nearly doubled when you account for rework.
Step 3: Count Handoffs
Each handoff between teams or roles is a queue point. If your value stream has 8 handoffs, you
have 8 places where work waits. Look for handoffs that could be eliminated by automation or
by reorganizing work within the team.
Step 4: Cross-Reference with Metrics
Check your findings against your baseline metrics:
- High lead time with low process time = the constraint is in the queues (wait time), not in
the work itself
- High change failure rate = the constraint is in quality practices, not in speed
- Low deployment frequency with everything else reasonable = the constraint is in the
deployment process itself or in organizational policy
Prioritizing: Fix the Biggest One First
One Constraint at a Time
Resist the temptation to tackle multiple constraints simultaneously. The Theory of Constraints
is clear: improving a non-bottleneck does not improve the system. Identify the single biggest
constraint, focus your migration effort there, and only move to the next constraint when the
first one is no longer the bottleneck.
This does not mean the entire team works on one thing. It means your improvement initiatives
are sequenced to address constraints in order of impact.
Once you have identified your top constraint, map it to a migration phase:
The Next Constraint
Fixing your first constraint will improve your flow. It will also reveal the next constraint.
This is expected and healthy. A delivery process is a chain, and strengthening the weakest link
means a different link becomes the weakest.
This is why the migration is organized in phases. Phase 1 addresses the foundational constraints
that nearly every team has (integration practices, testing, small work). Phase 2 addresses
pipeline constraints. Phase 3 optimizes flow. You will cycle through constraint identification
and resolution throughout your migration.
Plan to revisit your value stream map and metrics after addressing each major constraint. Your
map from today will be outdated within weeks of starting your migration - and that is a sign of
progress.
Next Step
Complete the Current State Checklist to assess your team against
specific MinimumCD practices and confirm your migration starting point.
3.1.4 - Current State Checklist
Self-assess your team against MinimumCD practices to understand your starting point and determine where to begin your migration.
This checklist translates the practices defined by MinimumCD.org into
concrete yes-or-no questions you can answer about your team today. It is not a test to pass. It is
a diagnostic tool that shows you which practices are already in place and which ones your migration
needs to establish.
Work through each category with your team. Be honest - checking a box you have not earned gives
you a migration plan that skips steps you actually need.
How to Use This Checklist
For each item, mark it with an [x] if your team consistently does this today - not occasionally,
not aspirationally, but as a default practice. If you do it sometimes but not reliably, leave it
unchecked.
Trunk-Based Development
Why it matters: Long-lived branches are the single biggest source of integration risk. Every
hour a branch lives is an hour where it diverges from what everyone else is doing. Trunk-based
development eliminates integration as a separate, painful event and makes it a continuous,
trivial activity. Without this practice, continuous integration is impossible, and without
continuous integration, continuous delivery is impossible.
Continuous Integration
Why it matters: Continuous integration means that the team always knows whether the codebase
is in a working state. If builds are not automated, if tests do not run on every commit, or if
broken builds are tolerated, then the team is flying blind. Every change is a gamble that
something else has not broken in the meantime.
Pipeline Practices
Why it matters: A pipeline is the mechanism that turns code changes into production
deployments. If the pipeline is inconsistent, manual, or bypassable, then you do not have a
reliable path to production. You have a collection of scripts and hopes. Deterministic, automated
pipelines are what make deployment a non-event rather than a high-risk ceremony.
Deployment
Why it matters: If your test environment does not look like production, your tests are lying
to you. If configuration is baked into your artifact, you are rebuilding for each environment,
which means the thing you tested is not the thing you deploy. If you cannot roll back quickly,
every deployment is a high-stakes bet. These practices ensure that what you test is what you
ship, and that shipping is safe.
Quality
Why it matters: Quality that depends on manual inspection does not scale and does not speed
up. As your deployment frequency increases through the migration, manual quality gates become
the bottleneck. The goal is to build quality in through automation so that a green build means
a deployable build. This is the foundation of continuous delivery: if it passes the pipeline,
it is ready for production.
Scoring Guide
Count the number of items you checked across all categories.
| Score |
Your Starting Point |
Recommended Phase |
| 0-5 |
You are early in your journey. Most foundational practices are not yet in place. |
Start at the beginning of Phase 1 - Foundations. Focus on trunk-based development and basic test automation first. |
| 6-12 |
You have some practices in place but significant gaps remain. This is the most common starting point. |
Start with Phase 1 - Foundations but focus on the categories where you had the fewest checks. Your constraint analysis will tell you which gap to close first. |
| 13-18 |
Your foundations are solid. The gaps are likely in pipeline automation and deployment practices. |
You may be able to move quickly through Phase 1 and focus your effort on Phase 2 - Pipeline. Validate with your value stream map that your remaining constraints match. |
| 19-22 |
You are well-practiced in most areas. Your migration is about closing specific gaps and optimizing flow. |
Review your unchecked items - they point to specific topics in Phase 3 - Optimize or Phase 4 - Deliver on Demand. |
| 23-25 |
You are already practicing most of what MinimumCD defines. Your focus should be on consistency and delivering on demand. |
Jump to Phase 4 - Deliver on Demand and focus on the capability to deploy any change when needed. |
A Score Is Not a Grade
This checklist exists to help your team find its starting point, not to judge your team’s
competence. A score of 5 does not mean your team is failing - it means your team has a clear
picture of what to work on. A score of 22 does not mean you are done - it means your remaining
gaps are specific and targeted.
The only wrong answer is a dishonest one.
Putting It All Together
You now have four pieces of information from Phase 0:
- A value stream map showing your end-to-end delivery process with wait times and rework loops
- Baseline metrics for deployment frequency, lead time, change failure rate, and MTTR
- An identified top constraint telling you where to focus first
- This checklist confirming which practices are in place and which are missing
Together, these give you a clear, data-informed starting point for your migration. You know where
you are, you know what is slowing you down, and you know which practices to establish first.
Next Step
You are ready to begin Phase 1 - Foundations. Start with the practice area
that addresses your top constraint.
3.2 - Phase 1: Foundations
Establish the essential practices for daily integration, testing, and small work decomposition.
Key question: “Can we integrate safely every day?”
This phase establishes the development practices that make continuous delivery possible.
Without these foundations, pipeline automation just speeds up a broken process.
What You’ll Do
- Adopt trunk-based development - Integrate to trunk at least daily
- Build testing fundamentals - Create a fast, reliable test suite
- Automate your build - One command to build, test, and package
- Decompose work - Break features into small, deliverable increments
- Streamline code review - Fast, effective review that doesn’t block flow
- Establish working agreements - Shared definitions of done and ready
- Everything as code - Infrastructure, pipelines, schemas, monitoring, and security policies in version control, delivered through pipelines
Why This Phase Matters
These practices are the prerequisites for everything that follows. Trunk-based development
eliminates merge hell. Testing fundamentals give you the confidence to deploy frequently.
Small work decomposition reduces risk per change. Together, they create the feedback loops
that drive continuous improvement.
When You’re Ready to Move On
You’re ready for Phase 2: Pipeline when:
- All developers integrate to trunk at least once per day
- Your test suite catches real defects and runs in under 10 minutes
- You can build and package your application with a single command
- Most work items are completable within 2 days
3.2.1 - Trunk-Based Development
Integrate all work to the trunk at least once per day to enable continuous integration.
Phase 1 - Foundations | Adapted from MinimumCD.org
Trunk-based development is the first foundation to establish. Without daily integration to a shared trunk, the rest of the CD migration cannot succeed. This page covers the core practice, two migration paths, and a tactical guide for getting started.
What Is Trunk-Based Development?
Trunk-based development (TBD) is a branching strategy where all developers integrate their work into a single shared branch - the trunk - at least once per day. The trunk is always kept in a releasable state.
This is a non-negotiable prerequisite for continuous delivery. If your team is not integrating to trunk daily, you are not doing CI, and you cannot do CD. There is no workaround.
“If it hurts, do it more often, and bring the pain forward.”
- Jez Humble, Continuous Delivery
What TBD Is Not
- It is not “everyone commits directly to
main with no guardrails.” You still test, review, and validate work - you just do it in small increments.
- It is not incompatible with code review. It requires review to happen quickly.
- It is not reckless. It is the opposite: small, frequent integrations are far safer than large, infrequent merges.
What Trunk-Based Development Improves
| Problem |
How TBD Helps |
| Merge conflicts |
Small changes integrated frequently rarely conflict |
| Integration risk |
Bugs are caught within hours, not weeks |
| Long-lived branches diverge from reality |
The trunk always reflects the current state of the codebase |
| “Works on my branch” syndrome |
Everyone shares the same integration point |
| Slow feedback |
CI runs on every integration, giving immediate signal |
| Large batch deployments |
Small changes are individually deployable |
| Fear of deployment |
Each change is small enough to reason about |
Two Migration Paths
There are two valid approaches to trunk-based development. Both satisfy the minimum CD requirement of daily integration. Choose the one that fits your team’s current maturity and constraints.
Path 1: Short-Lived Branches
Developers create branches that live for less than 24 hours. Work is done on the branch, reviewed quickly, and merged to trunk within a single day.
How it works:
- Pull the latest trunk
- Create a short-lived branch
- Make small, focused changes
- Open a pull request (or use pair programming as the review)
- Merge to trunk before end of day
- The branch is deleted after merge
Best for teams that:
- Currently use long-lived feature branches and need a stepping stone
- Have regulatory requirements for traceable review records
- Use pull request workflows they want to keep (but make faster)
- Are new to TBD and want a gradual transition
Key constraint: The branch must merge to trunk within 24 hours. If it does not, you have a long-lived branch and you have lost the benefit of TBD.
Path 2: Direct Trunk Commits
Developers commit directly to trunk. Quality is ensured through pre-commit checks, pair programming, and strong automated testing.
How it works:
- Pull the latest trunk
- Make a small, tested change locally
- Run the local build and test suite
- Push directly to trunk
- CI validates the commit immediately
Best for teams that:
- Have strong automated test coverage
- Practice pair or mob programming (which provides real-time review)
- Want maximum integration frequency
- Have high trust and shared code ownership
Key constraint: This requires excellent test coverage and a culture where the team owns quality collectively. Without these, direct trunk commits become reckless.
How to Choose Your Path
Ask these questions:
- Do you have automated tests that catch real defects? If no, start with Path 1 and invest in testing fundamentals in parallel.
- Does your organization require documented review approvals? If yes, use Path 1 with rapid pull requests.
- Does your team practice pair programming? If yes, Path 2 may work immediately - pairing is a continuous review process.
- How large is your team? Teams of 2-4 can adopt Path 2 more easily. Larger teams may start with Path 1 and transition later.
Both paths are valid. The important thing is daily integration to trunk. Do not spend weeks debating which path to use. Pick one, start today, and adjust.
Essential Supporting Practices
Trunk-based development does not work in isolation. These supporting practices make daily integration safe and sustainable.
Feature Flags
When you integrate to trunk daily, incomplete features will exist on trunk. Feature flags let you merge code that is not yet ready for users.
Rules for feature flags in TBD:
- Use flags to decouple deployment from release
- Remove flags within days or weeks - they are temporary by design
- Keep flag logic simple; avoid nested or dependent flags
- Test both flag states in your automated test suite
Feature flags are covered in more depth in Phase 3: Optimize.
Commit Small, Commit Often
Each commit should be a small, coherent change that leaves trunk in a working state. If you are committing once a day in a large batch, you are not getting the benefit of TBD.
Guidelines:
- Each commit should be independently deployable
- A commit should represent a single logical change
- If you cannot describe the change in one sentence, it is too big
- Target multiple commits per day, not one large commit at end of day
Test-Driven Development (TDD) and ATDD
TDD provides the safety net that makes frequent integration sustainable. When every change is accompanied by tests, you can integrate confidently.
- TDD: Write the test before the code. Red, green, refactor.
- ATDD (Acceptance Test-Driven Development): Write acceptance criteria as executable tests before implementation.
Both practices ensure that your test suite grows with your code and that trunk remains releasable.
Getting Started: A Tactical Guide
Step 1: Shorten Your Branches (Week 1)
If your team currently uses long-lived feature branches, start by shortening their lifespan.
| Current State |
Target |
| Branches live for weeks |
Branches live for < 1 week |
| Merge once per sprint |
Merge multiple times per week |
| Large merge conflicts are normal |
Conflicts are rare and small |
Action: Set a team agreement that no branch lives longer than 2 days. Track branch age as a metric.
Step 2: Integrate Daily (Week 2-3)
Tighten the window from 2 days to 1 day.
Action:
- Every developer merges to trunk at least once per day, every day they write code
- If work is not complete, use a feature flag or other technique to merge safely
- Track integration frequency as your primary metric
Step 3: Ensure Trunk Stays Green (Week 2-3)
Daily integration is only useful if trunk remains in a releasable state.
Action:
- Run your test suite on every merge to trunk
- If the build breaks, fixing it becomes the team’s top priority
- Establish a working agreement: “broken build = stop the line” (see Working Agreements)
Step 4: Remove the Safety Net of Long Branches (Week 4+)
Once the team is integrating daily with a green trunk, eliminate the option of long-lived branches.
Action:
- Configure branch protection rules to warn or block branches older than 24 hours
- Remove any workflow that depends on long-lived branches (e.g., “dev” or “release” branches)
- Celebrate the transition - this is a significant shift in how the team works
Key Pitfalls
1. “We integrate daily, but we also keep our feature branches”
If you are merging to trunk daily but also maintaining a long-lived feature branch, you are not doing TBD. The feature branch will diverge, and merging it later will be painful. The integration to trunk must be the only integration point.
2. “Our builds are too slow for frequent integration”
If your CI pipeline takes 30 minutes, integrating multiple times a day feels impractical. This is a real constraint - address it by investing in build automation and parallelizing your test suite. Target a build time under 10 minutes.
3. “We can’t integrate incomplete features to trunk”
Yes, you can. Use feature flags to hide incomplete work from users. The code exists on trunk, but the feature is not active. This is a standard practice at every company that practices CD.
4. “Code review takes too long for daily integration”
If pull request reviews take 2 days, daily integration is impossible. The solution is to change how you review: pair programming provides continuous review, mob programming reviews in real time, and small changes can be reviewed asynchronously in minutes. See Code Review for specific techniques.
5. “What if someone pushes a bad commit to trunk?”
This is why you have automated tests, CI, and the “broken build = top priority” agreement. Bad commits will happen. The question is how fast you detect and fix them. With TBD and CI, the answer is minutes, not days.
Measuring Success
Track these metrics to verify your TBD adoption:
| Metric |
Target |
Why It Matters |
| Integration frequency |
At least 1 per developer per day |
Confirms daily integration is happening |
| Branch age |
< 24 hours |
Catches long-lived branches |
| Build duration |
< 10 minutes |
Enables frequent integration without frustration |
| Merge conflict frequency |
Decreasing over time |
Confirms small changes reduce conflicts |
Further Reading
This page covers the essentials for Phase 1 of your migration. For detailed guidance on specific scenarios, see the full source material:
Next Step
Once your team is integrating to trunk daily, build the test suite that makes that integration trustworthy. Continue to Testing Fundamentals.
This content is adapted from MinimumCD.org,
licensed under CC BY 4.0.
3.2.2 - Testing Fundamentals
Build a test architecture that gives your pipeline the confidence to deploy any change, even when dependencies outside your control are unavailable.
Phase 1 - Foundations | Adapted from Dojo Consortium
Before you can trust your pipeline, you need a test suite that is fast, deterministic, and catches
real defects. But a collection of tests is not enough. You need a test architecture - a
deliberate structure where different types of tests work together to give you the confidence to
deploy every change, regardless of whether external systems are up, slow, or behaving
unexpectedly.
Why Testing Is a Foundation
Continuous delivery requires that trunk always be releasable. The only way to know trunk is
releasable is to test it - automatically, on every change. Without a reliable test suite, daily
integration is just daily risk.
In many organizations, testing is the single biggest obstacle to CD adoption. Not because teams
lack tests, but because the tests they have are slow, flaky, poorly structured, and - most
critically - unable to give the pipeline a reliable answer to the question: is this change safe
to deploy?
Testing Goals for CD
Your test suite must meet these criteria before it can support continuous delivery:
| Goal |
Target |
Why |
| Fast |
Full suite completes in under 10 minutes |
Developers need feedback before context-switching |
| Deterministic |
Same code always produces the same test result |
Flaky tests destroy trust and get ignored |
| Catches real bugs |
Tests fail when behavior is wrong, not when implementation changes |
Brittle tests create noise, not signal |
| Independent of external systems |
Pipeline can determine deployability without any dependency being available |
Your ability to deploy cannot be held hostage by someone else’s outage |
If your test suite does not meet these criteria today, improving it is your highest-priority
foundation work.
Beyond the Test Pyramid
The test pyramid - many unit tests at the base, fewer integration tests in the middle, a handful
of end-to-end tests at the top - has been the dominant mental model for test strategy since Mike
Cohn introduced it. The core insight is sound: push testing as low as possible. Lower-level
tests are faster, more deterministic, and cheaper to maintain. Higher-level tests are slower,
more brittle, and more expensive.
But as a prescriptive model, the pyramid is overly simplistic. Teams that treat it as a rigid
ratio end up in unproductive debates about whether they have “too many” integration tests or “not
enough” unit tests. The shape of your test distribution matters far less than whether your tests,
taken together, give you the confidence to deploy.
What actually matters
The pyramid’s principle - write tests with different granularity - remains correct. But for
CD, the question is not “do we have the right pyramid shape?” The question is:
Can our pipeline determine that a change is safe to deploy without depending on any system we
do not control?
This reframes the testing conversation. Instead of counting tests by type and trying to match a
diagram, you design a test architecture where:
-
Fast, deterministic tests catch the vast majority of defects and run on every commit.
These tests use test doubles for anything outside
the team’s control. They give you a reliable go/no-go signal in minutes.
-
Contract tests verify that your test doubles still match reality. They run asynchronously
and catch drift between your assumptions and the real world - without blocking your pipeline.
-
A small number of non-deterministic tests validate that the fully integrated system works.
These run post-deployment and provide monitoring, not gating.
This structure means your pipeline can confidently say “yes, deploy this” even if a downstream
API is having an outage, a third-party service is slow, or a partner team hasn’t deployed their
latest changes yet. Your ability to deliver is decoupled from the reliability of systems you do
not own.
The anti-pattern: the ice cream cone
Most teams that struggle with CD have an inverted test distribution - too many slow, expensive
end-to-end tests and too few fast, focused tests.
┌─────────────────────────┐
│ Manual Testing │ ← Most testing happens here
├─────────────────────────┤
│ End-to-End Tests │ ← Slow, flaky, expensive
├─────────────────────────┤
│ Integration Tests │ ← Some, but not enough
├───────────┤
│Unit Tests │ ← Too few
└───────────┘
The ice cream cone makes CD impossible. Manual testing gates block every release. End-to-end tests
take hours, fail randomly, and depend on external systems being healthy. The pipeline cannot give
a fast, reliable answer about deployability, so deployments become high-ceremony events.
Test Architecture for the CD Pipeline
A test architecture is the deliberate structure of how different test types work together across
your pipeline to give you deployment confidence. Each layer has a specific role, and the layers
reinforce each other.
Layer 1: Unit tests - verify logic in isolation
Unit tests exercise individual functions, methods, or components with all external dependencies
replaced by test doubles. They are the fastest and most
deterministic tests you have.
Role in CD: Catch logic errors, regressions, and edge cases instantly. Provide the tightest
feedback loop - developers should see results in seconds while coding.
What they cannot do: Verify that components work together, that your code correctly calls
external services, or that the system behaves correctly as a whole.
See Unit Tests for detailed guidance.
Layer 2: Integration tests - verify boundaries
Integration tests verify that components interact correctly at their boundaries: database queries
return the expected data, HTTP clients serialize requests correctly, message producers format
messages as expected. External systems are replaced with test doubles, but internal collaborators
are real.
Role in CD: Catch the bugs that unit tests miss - mismatched interfaces, serialization errors,
query bugs. These tests are fast enough to run on every commit but realistic enough to catch
real integration failures.
What they cannot do: Verify that the system works end-to-end from a user’s perspective, or
that your assumptions about external services are still correct.
The line between unit tests and integration tests is often debated. As Ham Vocke writes in
The Practical Test Pyramid:
the naming matters less than the discipline. The key question is whether the test is fast,
deterministic, and tests something your unit tests cannot. If yes, it belongs here.
See Integration Tests for detailed guidance.
Layer 3: Functional tests - verify your system works in isolation
Functional tests (also called component tests) exercise your entire sub-system - your service,
your application - from the outside, as a user or consumer would interact with it. All external
dependencies are replaced with test doubles. The test boots your application, sends real HTTP
requests or simulates real user interactions, and verifies the responses.
Role in CD: This is the layer that proves your system works as a complete unit, independent
of everything else. Functional tests answer: “if we deploy this service right now, will it
behave correctly for every interaction that is within our control?” Because all external
dependencies are stubbed, these tests are deterministic and fast. They can run on every commit.
Why this layer is critical for CD: Functional tests are what allow you to deploy with
confidence even when dependencies outside your control are unavailable. Your test doubles
simulate the expected behavior of those dependencies. As long as your doubles are accurate (which
is what contract tests verify), your functional tests prove your system handles those interactions
correctly.
See Functional Tests for detailed guidance.
Layer 4: Contract tests - verify your assumptions about others
Contract tests validate that the test doubles you use in layers 1-3 still accurately represent
the real external systems. They run against live dependencies and check contract format - response
structures, field names, types, and status codes - not specific data values.
Role in CD: Contract tests are the bridge between your fast, deterministic test suite and the
real world. Without them, your test doubles can silently drift from reality, and your functional
tests provide false confidence. With them, you know that the assumptions baked into your test
doubles are still correct.
Consumer-driven contracts take this further: the consumer of an API publishes expectations
(using tools like Pact), and the provider runs those expectations as part of
their build. Both teams know immediately when a change would break the contract.
Contract tests are non-deterministic because they hit live systems. They should not block
your pipeline. Instead, failures trigger a review: has the contract changed, or was it a transient
network issue? If the contract has changed, update your test doubles and re-verify.
See Contract Tests for detailed guidance.
Layer 5: End-to-end tests - verify the integrated system post-deployment
End-to-end tests validate complete user journeys through the fully integrated system with no
test doubles. They run against real services, real databases, and real third-party integrations.
Role in CD: E2E tests are monitoring, not gating. They run after deployment to verify that
the integrated system works. A small suite of smoke tests can run immediately post-deployment
to catch gross integration failures. Broader E2E suites run on a schedule.
Why E2E tests should not gate your pipeline: E2E tests are non-deterministic. They fail for
reasons unrelated to your change - network blips, third-party outages, shared environment
instability. If your pipeline depends on E2E tests passing before you can deploy, your deployment
frequency is limited by the reliability of every system in the chain. This is the opposite of the
independence CD requires.
See End-to-End Tests for detailed guidance.
How the layers work together
Pipeline stage Test layer Deterministic? Blocks deploy?
─────────────────────────────────────────────────────────────────────────
On every commit Unit tests Yes Yes
Integration tests Yes Yes
Functional tests Yes Yes
Asynchronous Contract tests No No (triggers review)
Post-deployment E2E smoke tests No Triggers rollback if critical
Synthetic monitoring No Triggers alerts
The critical insight: everything that blocks deployment is deterministic and under your
control. Everything that involves external systems runs asynchronously or post-deployment. This
is what gives you the independence to deploy any time, regardless of the state of the world
around you.
Week 1 Action Plan
If your test suite is not yet ready to support CD, use this focused action plan to make immediate
progress.
Day 1-2: Audit your current test suite
Assess where you stand before making changes.
Actions:
- Run your full test suite 3 times. Note total duration and any tests that pass intermittently
(flaky tests).
- Count tests by type: unit, integration, functional, end-to-end.
- Identify tests that require external dependencies (databases, APIs, file systems) to run.
- Record your baseline: total test count, pass rate, duration, flaky test count.
- Map each test type to a pipeline stage. Which tests gate deployment? Which run asynchronously?
Which tests couple your deployment to external systems?
Output: A clear picture of your test distribution and the specific problems to address.
Day 2-3: Fix or remove flaky tests
Flaky tests are worse than no tests. They train developers to ignore failures, which means real
failures also get ignored.
Actions:
- Quarantine all flaky tests immediately. Move them to a separate suite that does not block the
build.
- For each quarantined test, decide: fix it (if the behavior it tests matters) or delete it (if
it does not).
- Common causes of flakiness: timing dependencies, shared mutable state, reliance on external
services, test order dependencies.
- Target: zero flaky tests in your main test suite by end of week.
Day 3-4: Decouple your pipeline from external dependencies
This is the highest-leverage change for CD. Identify every test that calls a real external service
and replace that dependency with a test double.
Actions:
- List every external service your tests depend on: databases, APIs, message queues, file
storage, third-party services.
- For each dependency, decide the right test double approach:
- In-memory fakes for databases (e.g., SQLite, H2, testcontainers with local instances).
- HTTP stubs for external APIs (e.g., WireMock, nock, MSW).
- Fakes for message queues, email services, and other infrastructure.
- Replace the dependencies in your unit, integration, and functional tests.
- Move the original tests that hit real services into a separate suite - these become your
starting contract tests or E2E smoke tests.
Output: A test suite where everything that blocks the build is deterministic and runs without
network access to external systems.
Day 4-5: Add functional tests for critical paths
If you don’t have functional tests (component tests) that exercise your whole service in
isolation, start with the most critical paths.
Actions:
- Identify the 3-5 most critical user journeys or API endpoints in your application.
- Write a functional test for each: boot the application, stub external dependencies, send a
real request or simulate a real user action, verify the response.
- Each functional test should prove that the feature works correctly assuming external
dependencies behave as expected (which your test doubles encode).
- Run these in CI on every commit.
Day 5: Set up contract tests for your most important dependency
Pick the external dependency that changes most frequently or has caused the most production
issues. Set up a contract test for it.
Actions:
- Write a contract test that validates the response structure (types, required fields, status
codes) of the dependency’s API.
- Run it on a schedule (e.g., every hour or daily), not on every commit.
- When it fails, update your test doubles to match the new reality and re-verify your
functional tests.
- If the dependency is owned by another team in your organization, explore consumer-driven
contracts with a tool like Pact.
Test-Driven Development (TDD)
TDD is the practice of writing the test before the code. It is the most effective way to build a
reliable test suite because it ensures every piece of behavior has a corresponding test.
The TDD cycle:
- Red: Write a failing test that describes the behavior you want.
- Green: Write the minimum code to make the test pass.
- Refactor: Improve the code without changing the behavior. The test ensures you do not
break anything.
Why TDD supports CD:
- Every change is automatically covered by a test
- The test suite grows proportionally with the codebase
- Tests describe behavior, not implementation, making them more resilient to refactoring
- Developers get immediate feedback on whether their change works
TDD is not mandatory for CD, but teams that practice TDD consistently have significantly faster
and more reliable test suites.
Getting started with TDD
If your team is new to TDD, start small:
- Pick one new feature or bug fix this week.
- Write the test first, watch it fail.
- Write the code to make it pass.
- Refactor.
- Repeat for the next change.
Do not try to retroactively TDD your entire codebase. Apply TDD to new code and to any code you
modify.
Testing Matrix
Use this reference to decide what type of test to write and where it runs in your pipeline.
| What You Need to Verify |
Test Type |
Speed |
Deterministic? |
Blocks Deploy? |
See Also |
| A function or method behaves correctly |
Unit |
Milliseconds |
Yes |
Yes |
|
| Components interact correctly at a boundary |
Integration |
Milliseconds to seconds |
Yes |
Yes |
|
| Your whole service works in isolation |
Functional |
Seconds |
Yes |
Yes |
|
| Your test doubles match reality |
Contract |
Seconds |
No |
No |
|
| A critical user journey works end-to-end |
E2E |
Minutes |
No |
No |
|
| Code quality, security, and style compliance |
Static Analysis |
Seconds |
Yes |
Yes |
|
Best Practices Summary
Do
- Run tests on every commit. If tests do not run automatically, they will be skipped.
- Keep the deterministic suite under 10 minutes. If it is slower, developers will stop
running it locally.
- Fix broken tests immediately. A broken test is equivalent to a broken build.
- Delete tests that do not provide value. A test that never fails and tests trivial behavior
is maintenance cost with no benefit.
- Test behavior, not implementation. Tests should verify what the code does, not how it
does it. As Ham Vocke advises: “if I enter values
x and y, will the result be z?” - not
the sequence of internal calls that produce z.
- Use test doubles for external dependencies. Your deterministic tests should run without
network access to external systems.
- Validate test doubles with contract tests. Test doubles that drift from reality give false
confidence.
- Treat test code as production code. Give it the same care, review, and refactoring
attention.
Do Not
- Do not tolerate flaky tests. Quarantine or delete them immediately.
- Do not gate your pipeline on non-deterministic tests. E2E and contract test failures
should trigger review or alerts, not block deployment.
- Do not couple your deployment to external system availability. If a third-party API being
down prevents you from deploying, your test architecture has a critical gap.
- Do not write tests after the fact as a checkbox exercise. Tests written without
understanding the behavior they verify add noise, not value.
- Do not test private methods directly. Test the public interface; private methods are tested
indirectly.
- Do not share mutable state between tests. Each test should set up and tear down its own
state.
- Do not use sleep/wait for timing-dependent tests. Use explicit waits, polling, or
event-driven assertions.
- Do not require a running database or external service for unit tests. That makes them
integration tests - which is fine, but categorize them correctly.
Using Tests to Find and Eliminate Defect Sources
A test suite that catches bugs is good. A test suite that helps you stop producing those bugs
is transformational. Every test failure is evidence of a defect, and every defect has a source. If
you treat test failures only as things to fix, you are doing rework. If you treat them as
diagnostic data about where your process breaks down, you can make systemic changes that prevent
entire categories of defects from occurring.
This is the difference between a team that writes more tests to catch more bugs and a team that
changes how it works so that fewer bugs are created in the first place.
Trace every defect to its origin
When a test catches a defect - or worse, when a defect escapes to production - ask: where was
this defect introduced, and what would have prevented it from being created?
Defects do not originate randomly. They cluster around specific causes, and each cause has a
systemic fix:
| Where Defects Originate |
Example Defects |
Detection Method |
Systemic Fix |
| Requirements |
Building the right thing wrong, or the wrong thing right |
UX analytics, task completion tracking, A/B testing |
Acceptance criteria as user outcomes, not implementation tasks. Three Amigos sessions before work starts. Example mapping to surface edge cases before coding begins. |
| Missing domain knowledge |
Business rules encoded incorrectly, implicit assumptions |
Magic number detection, knowledge-concentration metrics |
Embed domain rules in code using ubiquitous language (DDD). Pair programming to spread knowledge. Living documentation generated from code. |
| Integration boundaries |
Interface mismatches, wrong assumptions about upstream behavior |
Consumer-driven contract tests, schema validation |
Contract tests mandatory per boundary. API-first design. Document behavioral contracts, not just data schemas. |
| Untested edge cases |
Null handling, boundary values, error paths |
Mutation testing, branch coverage thresholds, property-based testing |
Require a test for every bug fix. Adopt property-based testing for logic with many input permutations. Boundary value analysis as a standard practice. |
| Unintended side effects |
Change to module A breaks module B |
Mutation testing, change impact analysis |
Small focused commits. Trunk-based development (integrate daily so side effects surface immediately). Modular design with clear boundaries. |
| Accumulated complexity |
Defects cluster in the most complex, most-changed files |
Complexity trends, duplication scoring, dependency cycle detection |
Refactoring as part of every story, not deferred to a “tech debt sprint.” Dedicated complexity budget. |
| Long-lived branches |
Merge conflicts, integration failures, stale code |
Branch age alerts, merge conflict frequency |
Trunk-based development. Merge at least daily. CI rejects stale branches. |
| Configuration drift |
Works in staging, fails in production |
IaC drift detection, environment comparison, smoke tests |
All infrastructure as code. Same provisioning for every environment. Immutable infrastructure. |
| Data assumptions |
Null pointer exceptions, schema migration failures |
Null safety static analysis, schema compatibility checks, migration dry-runs |
Enforce null-safe types. Expand-then-contract for all schema changes. |
Build a defect feedback loop
Knowing the categories is not enough. You need a process that systematically connects test
failures to root causes and root causes to systemic fixes.
Step 1: Classify every defect. When a test fails or a bug is reported, tag it with its origin
category from the table above. This takes seconds and builds a dataset over time.
Step 2: Look for patterns. Monthly (or during retrospectives), review the defect
classifications. Which categories appear most often? That is where your process is weakest.
Step 3: Apply the systemic fix, not just the local fix. When you fix a bug, also ask: what
systemic change would prevent this entire category of bug? If most defects come from integration
boundaries, the fix is not “write more integration tests” - it is “make contract tests mandatory
for every new boundary.” If most defects come from untested edge cases, the fix is not “increase
code coverage” - it is “adopt property-based testing as a standard practice.”
Step 4: Measure whether the fix works. Track defect counts by category over time. If you
applied a systemic fix for integration boundary defects and the count does not drop, the fix is
not working and you need a different approach.
The test-for-every-bug-fix rule
One of the most effective systemic practices: every bug fix must include a test that
reproduces the bug before the fix and passes after. This is non-negotiable for CD because:
- It proves the fix actually addresses the defect (not just the symptom).
- It prevents the same defect from recurring.
- It builds test coverage exactly where the codebase is weakest - the places where bugs actually
occur.
- Over time, it shifts your test suite from “tests we thought to write” to “tests that cover
real failure modes.”
Advanced detection techniques
As your test architecture matures, add techniques that find defects humans overlook:
| Technique |
What It Finds |
When to Adopt |
| Mutation testing (Stryker, PIT) |
Tests that pass but do not actually verify behavior - your test suite’s blind spots |
When basic coverage is in place but defect escape rate is not dropping |
| Property-based testing |
Edge cases and boundary conditions across large input spaces that example-based tests miss |
When defects cluster around unexpected input combinations |
| Chaos engineering |
Failure modes in distributed systems - what happens when a dependency is slow, returns errors, or disappears |
When you have functional tests and contract tests in place and need confidence in failure handling |
| Static analysis and linting |
Null safety violations, type errors, security vulnerabilities, dead code |
From day one - these are cheap and fast |
For more examples of mapping defect origins to detection methods and systemic corrections, see
the CD Defect Detection and Remediation Patterns.
Measuring Success
| Metric |
Target |
Why It Matters |
| Deterministic suite duration |
< 10 minutes |
Enables fast feedback loops |
| Flaky test count |
0 in pipeline-gating suite |
Maintains trust in test results |
| External dependencies in gating tests |
0 |
Ensures deployment independence |
| Test coverage trend |
Increasing |
Confirms new code is being tested |
| Defect escape rate |
Decreasing |
Confirms tests catch real bugs |
| Contract test freshness |
All passing within last 24 hours |
Confirms test doubles are current |
Next Step
With a reliable test suite in place, automate your build process so that building, testing, and
packaging happens with a single command. Continue to Build Automation.
This content is adapted from the Dojo Consortium,
licensed under CC BY 4.0. Additional concepts
drawn from Ham Vocke,
The Practical Test Pyramid,
and Toby Clemson,
Testing Strategies in a Microservice Architecture.
3.2.3 - Build Automation
Automate your build process so a single command builds, tests, and packages your application.
Phase 1 - Foundations | Adapted from Dojo Consortium
Build automation is the mechanism that turns trunk-based development and testing into a continuous integration loop. If you cannot build, test, and package your application with a single command, you cannot automate your pipeline. This page covers the practices that make your build reproducible, fast, and trustworthy.
What Build Automation Means
Build automation is the practice of scripting every step required to go from source code to a deployable artifact. A single command - or a single CI trigger - should execute the entire sequence:
- Compile the source code (if applicable)
- Run all automated tests
- Package the application into a deployable artifact (container image, binary, archive)
- Report the result (pass or fail, with details)
No manual steps. No “run this script, then do that.” No tribal knowledge about which flags to set or which order to run things. One command, every time, same result.
The Litmus Test
Ask yourself: “Can a new team member clone the repository and produce a deployable artifact with a single command within 15 minutes?”
If the answer is no, your build is not fully automated.
Why Build Automation Matters for CD
| CD Requirement |
How Build Automation Supports It |
| Reproducibility |
The same commit always produces the same artifact, on any machine |
| Speed |
Automated builds can be optimized, cached, and parallelized |
| Confidence |
If the build passes, the artifact is trustworthy |
| Developer experience |
Developers run the same build locally that CI runs, eliminating “works on my machine” |
| Pipeline foundation |
The CI/CD pipeline is just the build running automatically on every commit |
Without build automation, every other practice in this guide breaks down. You cannot have continuous integration if the build requires manual intervention. You cannot have a deterministic pipeline if the build produces different results depending on who runs it.
Key Practices
1. Version-Controlled Build Scripts
Your build configuration lives in the same repository as your code. It is versioned, reviewed, and tested alongside the application.
What belongs in version control:
- Build scripts (Makefile, build.gradle, package.json scripts, Dockerfile)
- Dependency manifests (requirements.txt, go.mod, pom.xml, package-lock.json)
- CI/CD pipeline definitions (.github/workflows, .gitlab-ci.yml, Jenkinsfile)
- Environment setup scripts (docker-compose.yml for local development)
What does not belong in version control:
- Secrets and credentials (use secret management tools)
- Environment-specific configuration values (use environment variables or config management)
- Generated artifacts (build outputs, compiled binaries)
Anti-pattern: Build instructions that exist only in a wiki, a Confluence page, or one developer’s head. If the build steps are not in the repository, they will drift from reality.
2. Dependency Management
All dependencies must be declared explicitly and resolved deterministically.
Practices:
- Lock files: Use lock files (package-lock.json, Pipfile.lock, go.sum) to pin exact dependency versions. Check lock files into version control.
- Reproducible resolution: Running the dependency install twice should produce identical results.
- No undeclared dependencies: Your build should not rely on tools or libraries that happen to be installed on the build machine. If you need it, declare it.
- Dependency scanning: Automate vulnerability scanning of dependencies as part of the build. Do not wait for a separate security review.
Anti-pattern: “It builds on Jenkins because Jenkins has Java 11 installed, but the Dockerfile uses Java 17.” The build must declare and control its own runtime.
3. Build Caching
Fast builds keep developers in flow. Caching is the primary mechanism for build speed.
What to cache:
- Dependencies: Download once, reuse across builds. Most build tools (npm, Maven, Gradle, pip) support a local cache.
- Compilation outputs: Incremental compilation avoids rebuilding unchanged modules.
- Docker layers: Structure your Dockerfile so that rarely-changing layers (OS, dependencies) are cached and only the application code layer is rebuilt.
- Test fixtures: Prebuilt test data or container images used by tests.
Guidelines:
- Cache aggressively for local development and CI
- Invalidate caches when dependencies or build configuration change
- Do not cache test results - tests must always run
4. Single Build Script Entry Point
Developers, CI, and CD should all use the same entry point.
The CI server runs make all. A developer runs make all. The result is the same. There is no separate “CI build script” that diverges from what developers run locally.
5. Artifact Versioning
Every build artifact must be traceable to the exact commit that produced it.
Practices:
- Tag artifacts with the Git commit SHA or a build number derived from it
- Store build metadata (commit, branch, timestamp, builder) in the artifact or alongside it
- Never overwrite an existing artifact - if the version exists, the artifact is immutable
This becomes critical in Phase 2 when you establish immutable artifact practices.
CI Server Setup Basics
The CI server is the mechanism that runs your build automatically. In Phase 1, the setup is straightforward:
What the CI Server Does
- Watches the trunk for new commits
- Runs the build (the same command a developer would run locally)
- Reports the result (pass/fail, test results, build duration)
- Notifies the team if the build fails
Minimum CI Configuration
Regardless of which CI tool you use (GitHub Actions, GitLab CI, Jenkins, CircleCI), the configuration follows the same pattern:
CI Principles for Phase 1
- Run on every commit. Not nightly, not weekly, not “when someone remembers.” Every commit to trunk triggers a build.
- Keep the build green. A failing build is the team’s top priority. Work stops until trunk is green again. (See Working Agreements.)
- Run the same build everywhere. The CI server runs the same script as local development. No CI-only steps that developers cannot reproduce.
- Fail fast. Run the fastest checks first (compilation, unit tests) before the slower ones (integration tests, packaging).
Build Time Targets
Build speed directly affects developer productivity and integration frequency. If the build takes 30 minutes, developers will not integrate multiple times per day.
| Build Phase |
Target |
Rationale |
| Compilation |
< 1 minute |
Developers need instant feedback on syntax and type errors |
| Unit tests |
< 3 minutes |
Fast enough to run before every commit |
| Integration tests |
< 5 minutes |
Must complete before the developer context-switches |
| Full build (compile + test + package) |
< 10 minutes |
The outer bound for fast feedback |
If Your Build Is Too Slow
Slow builds are a common constraint that blocks CD adoption. Address them systematically:
- Profile the build. Identify which steps take the most time. Optimize the bottleneck, not everything.
- Parallelize tests. Most test frameworks support parallel execution. Run independent test suites concurrently.
- Use build caching. Avoid recompiling or re-downloading unchanged dependencies.
- Split the build. Run fast checks (lint, compile, unit tests) as a “fast feedback” stage. Run slower checks (integration tests, security scans) as a second stage.
- Upgrade build hardware. Sometimes the fastest optimization is more CPU and RAM.
The target is under 10 minutes for the feedback loop that developers use on every commit. Longer-running validation (E2E tests, performance tests) can run in a separate stage.
Common Anti-Patterns
Manual Build Steps
Symptom: The build process includes steps like “open this tool and click Run” or “SSH into the build server and execute this script.”
Problem: Manual steps are error-prone, slow, and cannot be parallelized or cached. They are the single biggest obstacle to build automation.
Fix: Script every step. If a human must perform the step today, write a script that performs it tomorrow.
Environment-Specific Builds
Symptom: The build produces different artifacts for different environments (dev, staging, production). Or the build only works on specific machines because of pre-installed tools.
Problem: Environment-specific builds mean you are not testing the same artifact you deploy. Bugs that appear in production but not in staging become impossible to diagnose.
Fix: Build one artifact and configure it per environment at deployment time. The artifact is immutable; the configuration is external. (See Application Config in Phase 2.)
Build Scripts That Only Run in CI
Symptom: The CI pipeline has build steps that developers cannot run locally. Local development uses a different build process.
Problem: Developers cannot reproduce CI failures locally, leading to slow debugging cycles and “push and pray” development.
Fix: Use a single build entry point (Makefile, build script) that both CI and developers use. CI configuration should only add triggers and notifications, not build logic.
Missing Dependency Pinning
Symptom: Builds break randomly because a dependency released a new version overnight.
Problem: Without pinned dependencies, the build is non-deterministic. The same code can produce different results on different days.
Fix: Use lock files. Pin all dependency versions. Update dependencies intentionally, not accidentally.
Long Build Queues
Symptom: Developers commit to trunk, but the build does not run for 20 minutes because the CI server is processing a queue.
Problem: Delayed feedback defeats the purpose of CI. If developers do not see the result of their commit for 30 minutes, they have already moved on.
Fix: Ensure your CI infrastructure can handle your team’s commit frequency. Use parallel build agents. Prioritize builds on the main branch.
Measuring Success
| Metric |
Target |
Why It Matters |
| Build duration |
< 10 minutes |
Enables fast feedback and frequent integration |
| Build success rate |
> 95% |
Indicates reliable, reproducible builds |
| Time from commit to build result |
< 15 minutes (including queue time) |
Measures the full feedback loop |
| Developer ability to build locally |
100% of team |
Confirms the build is portable and documented |
Next Step
With build automation in place, you can build, test, and package your application reliably. The next foundation is ensuring that the work you integrate daily is small enough to be safe. Continue to Work Decomposition.
This content is adapted from the Dojo Consortium,
licensed under CC BY 4.0.
3.2.4 - Work Decomposition
Break features into small, deliverable increments that can be completed in 2 days or less.
Phase 1 - Foundations | Adapted from Dojo Consortium
Trunk-based development requires daily integration, and daily integration requires small work. If a feature takes two weeks to build, you cannot integrate it daily without decomposing it first. This page covers the techniques for breaking work into small, deliverable increments that flow through your pipeline continuously.
Why Small Work Matters for CD
Continuous delivery depends on a simple equation: small changes, integrated frequently, are safer than large changes integrated rarely.
Every practice in Phase 1 reinforces this:
- Trunk-based development requires that you integrate at least daily. You cannot integrate a two-week feature daily unless you decompose it.
- Testing fundamentals work best when each change is small enough to test thoroughly.
- Code review is fast when the change is small. A 50-line change can be reviewed in minutes. A 2,000-line change takes hours - if it gets reviewed at all.
The data supports this. The DORA research consistently shows that smaller batch sizes correlate with higher delivery performance. Small changes have:
- Lower risk: If a small change breaks something, the blast radius is limited, and the cause is obvious.
- Faster feedback: A small change gets through the pipeline quickly. You learn whether it works today, not next week.
- Easier rollback: Rolling back a 50-line change is straightforward. Rolling back a 2,000-line change often requires a new deployment.
- Better flow: Small work items move through the system predictably. Large work items block queues and create bottlenecks.
The 2-Day Rule
If a work item takes longer than 2 days to complete, it is too big.
This is not arbitrary. Two days gives you at least one integration to trunk per day (the minimum for TBD) and allows for the natural rhythm of development: plan, implement, test, integrate, move on.
When a developer says “this will take a week,” the answer is not “go faster.” The answer is “break it into smaller pieces.”
What “Complete” Means
A work item is complete when it is:
- Integrated to trunk
- All tests pass
- The change is deployable (even if the feature is not yet user-visible)
- It meets the Definition of Done
If a story requires a feature flag to hide incomplete user-facing behavior, that is fine. The code is still integrated, tested, and deployable.
Story Slicing Techniques
Story slicing is the practice of breaking user stories into the smallest possible increments that still deliver value or make progress toward delivering value.
The INVEST Criteria
Good stories follow INVEST:
| Criterion |
Meaning |
Why It Matters for CD |
| Independent |
Can be developed and deployed without waiting for other stories |
Enables parallel work and avoids blocking |
| Negotiable |
Details can be discussed and adjusted |
Allows the team to find the smallest valuable slice |
| Valuable |
Delivers something meaningful to the user or the system |
Prevents “technical stories” that do not move the product forward |
| Estimable |
Small enough that the team can reasonably estimate it |
Large stories are unestimable because they hide unknowns |
| Small |
Completable within 2 days |
Enables daily integration and fast feedback |
| Testable |
Has clear acceptance criteria that can be automated |
Supports the testing foundation |
Vertical Slicing
The most important slicing technique for CD is vertical slicing: cutting through all layers of the application to deliver a thin but complete slice of functionality.
Vertical slice (correct):
“As a user, I can log in with my email and password.”
This slice touches the UI (login form), the API (authentication endpoint), and the database (user lookup). It is deployable and testable end-to-end.
Horizontal slice (anti-pattern):
“Build the database schema for user accounts.”
“Build the authentication API.”
“Build the login form UI.”
Each horizontal slice is incomplete on its own. None is deployable. None is testable end-to-end. They create dependencies between work items and block flow.
Slicing Strategies
When a story feels too big, apply one of these strategies:
| Strategy |
How It Works |
Example |
| By workflow step |
Implement one step of a multi-step process |
“User can add items to cart” (before “user can checkout”) |
| By business rule |
Implement one rule at a time |
“Orders over $100 get free shipping” (before “orders ship to international addresses”) |
| By data variation |
Handle one data type first |
“Support credit card payments” (before “support PayPal”) |
| By operation |
Implement CRUD operations separately |
“Create a new customer” (before “edit customer” or “delete customer”) |
| By performance |
Get it working first, optimize later |
“Search returns results” (before “search returns results in under 200ms”) |
| By platform |
Support one platform first |
“Works on desktop web” (before “works on mobile”) |
| Happy path first |
Implement the success case first |
“User completes checkout” (before “user sees error when payment fails”) |
Example: Decomposing a Feature
Original story (too big):
“As a user, I can manage my profile including name, email, avatar, password, notification preferences, and two-factor authentication.”
Decomposed into vertical slices:
- “User can view their current profile information” (read-only display)
- “User can update their name” (simplest edit)
- “User can update their email with verification” (adds email flow)
- “User can upload an avatar image” (adds file handling)
- “User can change their password” (adds security validation)
- “User can configure notification preferences” (adds preferences)
- “User can enable two-factor authentication” (adds 2FA flow)
Each slice is independently deployable, testable, and completable within 2 days. Each delivers incremental value. The feature is built up over a series of small deliveries rather than one large batch.
Behavior-Driven Development (BDD) is not just a testing practice - it is a powerful tool for decomposing work into small, clear increments.
Three Amigos
Before work begins, hold a brief “Three Amigos” session with three perspectives:
- Business/Product: What should this feature do? What is the expected behavior?
- Development: How will we build it? What are the technical considerations?
- Testing: How will we verify it? What are the edge cases?
This 15-30 minute conversation accomplishes two things:
- Shared understanding: Everyone agrees on what “done” looks like before work begins.
- Natural decomposition: Discussing specific scenarios reveals natural slice boundaries.
Specification by Example
Write acceptance criteria as concrete examples, not abstract requirements.
Abstract (hard to slice):
“The system should validate user input.”
Concrete (easy to slice):
- Given an email field, when the user enters “not-an-email”, then the form shows “Please enter a valid email address.”
- Given a password field, when the user enters fewer than 8 characters, then the form shows “Password must be at least 8 characters.”
- Given a name field, when the user leaves it blank, then the form shows “Name is required.”
Each concrete example can become its own story or task. The scope is clear, the acceptance criteria are testable, and the work is small.
Structure acceptance criteria in Given-When-Then format to make them executable:
Each scenario is a natural unit of work. Implement one scenario at a time, integrate to trunk after each one.
Task Decomposition Within Stories
Even well-sliced stories may contain multiple tasks. Decompose stories into tasks that can be completed and integrated independently.
Example story: “User can update their name”
Tasks:
- Add the name field to the profile API endpoint (backend change, integration test)
- Add the name field to the profile form (frontend change, unit test)
- Connect the form to the API endpoint (integration, E2E test)
Each task results in a commit to trunk. The story is completed through a series of small integrations, not one large merge.
Guidelines for task decomposition:
- Each task should take hours, not days
- Each task should leave trunk in a working state after integration
- Tasks should be ordered so that the simplest changes come first
- If a task requires a feature flag or stub to be integrated safely, that is fine
Common Anti-Patterns
Horizontal Slicing
Symptom: Stories are organized by architectural layer: “build the database schema,” “build the API,” “build the UI.”
Problem: No individual slice is deployable or testable end-to-end. Integration happens at the end, which is where bugs are found and schedules slip.
Fix: Slice vertically. Every story should touch all the layers needed to deliver a thin slice of complete functionality.
Technical Stories
Symptom: The backlog contains stories like “refactor the database access layer” or “upgrade to React 18” that do not deliver user-visible value.
Problem: Technical work is important, but when it is separated from feature work, it becomes hard to prioritize and easy to defer. It also creates large, risky changes.
Fix: Embed technical improvements in feature stories. Refactor as you go. If a technical change is necessary, tie it to a specific business outcome and keep it small enough to complete in 2 days.
Stories That Are Really Epics
Symptom: A story has 10+ acceptance criteria, or the estimate is “8 points” or “2 weeks.”
Problem: Large stories hide unknowns, resist estimation, and cannot be integrated daily.
Fix: If a story has more than 3-5 acceptance criteria, it is an epic. Break it into smaller stories using the slicing strategies above.
Splitting by Role Instead of by Behavior
Symptom: Separate stories for “frontend developer builds the UI” and “backend developer builds the API.”
Problem: This creates handoff dependencies and delays integration. The feature is not testable until both stories are complete.
Fix: Write stories from the user’s perspective. The same developer (or pair) implements the full vertical slice.
Deferring “Edge Cases” Indefinitely
Symptom: The team builds the happy path and creates a backlog of “handle error case X” stories that never get prioritized.
Problem: Error handling is not optional. Unhandled edge cases become production incidents.
Fix: Include the most important error cases in the initial story decomposition. Use the “happy path first” slicing strategy, but schedule edge case stories immediately after, not “someday.”
Measuring Success
| Metric |
Target |
Why It Matters |
| Story cycle time |
< 2 days from start to trunk |
Confirms stories are small enough |
| Development cycle time |
Decreasing |
Shows improved flow from smaller work |
| Stories completed per week |
Increasing (with same team size) |
Indicates better decomposition and less rework |
| Work in progress |
Decreasing |
Fewer large stories blocking the pipeline |
Next Step
Small, well-decomposed work flows through the system quickly - but only if code review does not become a bottleneck. Continue to Code Review to learn how to keep review fast and effective.
This content is adapted from the Dojo Consortium,
licensed under CC BY 4.0.
3.2.5 - Code Review
Streamline code review to provide fast feedback without blocking flow.
Phase 1 - Foundations | Adapted from Dojo Consortium
Code review is essential for quality, but it is also the most common bottleneck in teams adopting trunk-based development. If reviews take days, daily integration is impossible. This page covers review techniques that maintain quality while enabling the flow that CD requires.
Why Code Review Matters for CD
Code review serves multiple purposes:
- Defect detection: A second pair of eyes catches bugs that the author missed.
- Knowledge sharing: Reviews spread understanding of the codebase across the team.
- Consistency: Reviews enforce coding standards and architectural patterns.
- Mentoring: Junior developers learn by having their code reviewed and by reviewing others’ code.
These are real benefits. The challenge is that traditional code review - open a pull request, wait for someone to review it, address comments, wait again - is too slow for CD.
In a CD workflow, code review must happen within minutes or hours, not days. The review is still rigorous, but the process is designed for speed.
The Core Tension: Quality vs. Flow
Traditional teams optimize review for thoroughness: detailed comments, multiple reviewers, extensive back-and-forth. This produces high-quality reviews but blocks flow.
CD teams optimize review for speed without sacrificing the quality that matters. The key insight is that most of the quality benefit of code review comes from small, focused reviews done quickly, not from exhaustive reviews done slowly.
| Traditional Review |
CD-Compatible Review |
| Review happens after the feature is complete |
Review happens continuously throughout development |
| Large diffs (hundreds or thousands of lines) |
Small diffs (< 200 lines, ideally < 50) |
| Multiple rounds of feedback and revision |
One round, or real-time feedback during pairing |
| Review takes 1-3 days |
Review takes minutes to a few hours |
| Review is asynchronous by default |
Review is synchronous by preference |
| 2+ reviewers required |
1 reviewer (or pairing as the review) |
Synchronous vs. Asynchronous Review
Synchronous Review (Preferred for CD)
In synchronous review, the reviewer and author are engaged at the same time. Feedback is immediate. Questions are answered in real time. The review is done when the conversation ends.
Methods:
- Pair programming: Two developers work on the same code at the same time. Review is continuous. There is no separate review step because the code was reviewed as it was written.
- Mob programming: The entire team (or a subset) works on the same code together. Everyone reviews in real time.
- Over-the-shoulder review: The author walks the reviewer through the change in person or on a video call. The reviewer asks questions and provides feedback immediately.
Advantages for CD:
- Zero wait time between “ready for review” and “review complete”
- Higher bandwidth communication (tone, context, visual cues) catches more issues
- Immediate resolution of questions - no async back-and-forth
- Knowledge transfer happens naturally through the shared work
Asynchronous Review (When Necessary)
Sometimes synchronous review is not possible - time zones, schedules, or team preferences may require asynchronous review. This is fine, but it must be fast.
Rules for async review in a CD workflow:
- Review within 2 hours. If a pull request sits for a day, it blocks integration. Set a team working agreement: “pull requests are reviewed within 2 hours during working hours.”
- Keep changes small. A 50-line change can be reviewed in 5 minutes. A 500-line change takes an hour and reviewers procrastinate on it.
- Use draft PRs for early feedback. If you want feedback on an approach before the code is complete, open a draft PR. Do not wait until the change is “perfect.”
- Avoid back-and-forth. If a comment requires discussion, move to a synchronous channel (call, chat). Async comment threads that go 5 rounds deep are a sign the change is too large or the design was not discussed upfront.
Review Techniques Compatible with TBD
Pair Programming as Review
When two developers pair on a change, the code is reviewed as it is written. There is no separate review step, no pull request waiting for approval, and no delay to integration.
How it works with TBD:
- Two developers sit together (physically or via screen share)
- They discuss the approach, write the code, and review each other’s decisions in real time
- When the change is ready, they commit to trunk together
- Both developers are accountable for the quality of the code
When to pair:
- New or unfamiliar areas of the codebase
- Changes that affect critical paths
- When a junior developer is working on a change (pairing doubles as mentoring)
- Any time the change involves design decisions that benefit from discussion
Pair programming satisfies most organizations’ code review requirements because two developers have actively reviewed and approved the code.
Mob Programming as Review
Mob programming extends pairing to the whole team. One person drives (types), one person navigates (directs), and the rest observe and contribute.
When to mob:
- Establishing new patterns or architectural decisions
- Complex changes that benefit from multiple perspectives
- Onboarding new team members to the codebase
- Working through particularly difficult problems
Mob programming is intensive but highly effective. Every team member understands the code, the design decisions, and the trade-offs.
Rapid Async Review
For teams that use pull requests, rapid async review adapts the pull request workflow for CD speed.
Practices:
- Auto-assign reviewers. Do not wait for someone to volunteer. Use tools to automatically assign a reviewer when a PR is opened.
- Keep PRs small. Target < 200 lines of changed code. Smaller PRs get reviewed faster and more thoroughly.
- Provide context. Write a clear PR description that explains what the change does, why it is needed, and how to verify it. A good description reduces review time dramatically.
- Use automated checks. Run linting, formatting, and tests before the human review. The reviewer should focus on logic and design, not style.
- Approve and merge quickly. If the change looks correct, approve it. Do not hold it for nitpicks. Nitpicks can be addressed in a follow-up commit.
What to Review
Not everything in a code change deserves the same level of scrutiny. Focus reviewer attention where it matters most.
High Priority (Reviewer Should Focus Here)
- Behavior correctness: Does the code do what it is supposed to do? Are edge cases handled?
- Security: Does the change introduce vulnerabilities? Are inputs validated? Are secrets handled properly?
- Clarity: Can another developer understand this code in 6 months? Are names clear? Is the logic straightforward?
- Test coverage: Are the new behaviors tested? Do the tests verify the right things?
- API contracts: Do changes to public interfaces maintain backward compatibility? Are they documented?
- Error handling: What happens when things go wrong? Are errors caught, logged, and surfaced appropriately?
Low Priority (Automate Instead of Reviewing)
- Code style and formatting: Use automated formatters (Prettier, Black, gofmt). Do not waste reviewer time on indentation and bracket placement.
- Import ordering: Automate with linting rules.
- Naming conventions: Enforce with lint rules where possible. Only flag naming in review if it genuinely harms readability.
- Unused variables or imports: Static analysis tools catch these instantly.
- Consistent patterns: Where possible, encode patterns in architecture decision records and lint rules rather than relying on reviewers to catch deviations.
Rule of thumb: If a style or convention issue can be caught by a machine, do not ask a human to catch it. Reserve human attention for the things machines cannot evaluate: correctness, design, clarity, and security.
Review Scope for Small Changes
In a CD workflow, most changes are small - tens of lines, not hundreds. This changes the economics of review.
| Change Size |
Expected Review Time |
Review Depth |
| < 20 lines |
2-5 minutes |
Quick scan: is it correct? Any security issues? |
| 20-100 lines |
5-15 minutes |
Full review: behavior, tests, clarity |
| 100-200 lines |
15-30 minutes |
Detailed review: design, contracts, edge cases |
| > 200 lines |
Consider splitting the change |
Large changes get superficial reviews |
Research consistently shows that reviewer effectiveness drops sharply after 200-400 lines. If you are regularly reviewing changes larger than 200 lines, the problem is not the review process - it is the work decomposition.
Working Agreements for Review SLAs
Establish clear team agreements about review expectations. Without explicit agreements, review latency will drift based on individual habits.
Recommended Review Agreements
| Agreement |
Target |
| Response time |
Review within 2 hours during working hours |
| Reviewer count |
1 reviewer (or pairing as the review) |
| PR size |
< 200 lines of changed code |
| Blocking issues only |
Only block a merge for correctness, security, or significant design issues |
| Nitpicks |
Use a “nit:” prefix. Nitpicks are suggestions, not merge blockers |
| Stale PRs |
PRs open for > 24 hours are escalated to the team |
| Self-review |
Author reviews their own diff before requesting review |
How to Enforce Review SLAs
- Track review turnaround time. If it consistently exceeds 2 hours, discuss it in retrospectives.
- Make review a first-class responsibility, not something developers do “when they have time.”
- If a reviewer is unavailable, any other team member can review. Do not create single-reviewer dependencies.
- Consider pairing as the default and async review as the exception. This eliminates the review bottleneck entirely.
Code Review and Trunk-Based Development
Code review and TBD work together, but only if review does not block integration. Here is how to reconcile them:
| TBD Requirement |
How Review Adapts |
| Integrate to trunk at least daily |
Reviews must complete within hours, not days |
| Branches live < 24 hours |
PRs are opened and merged within the same day |
| Trunk is always releasable |
Reviewers focus on correctness, not perfection |
| Small, frequent changes |
Small changes are reviewed quickly and thoroughly |
If your team finds that review is the bottleneck preventing daily integration, the most effective solution is to adopt pair programming. It eliminates the review step entirely by making review continuous.
Measuring Success
| Metric |
Target |
Why It Matters |
| Review turnaround time |
< 2 hours |
Prevents review from blocking integration |
| PR size (lines changed) |
< 200 lines |
Smaller PRs get faster, more thorough reviews |
| PR age at merge |
< 24 hours |
Aligns with TBD branch age constraint |
| Review rework cycles |
< 2 rounds |
Multiple rounds indicate the change is too large or design was not discussed upfront |
Next Step
Code review practices need to be codified in team agreements alongside other shared commitments. Continue to Working Agreements to establish your team’s definitions of done, ready, and CI practice.
This content is adapted from the Dojo Consortium,
licensed under CC BY 4.0.
3.2.6 - Working Agreements
Establish shared definitions of done and ready to align the team on quality and process.
Phase 1 - Foundations | Adapted from Dojo Consortium
The practices in Phase 1 - trunk-based development, testing, small work, and fast review - only work when the whole team commits to them. Working agreements make that commitment explicit. This page covers the key agreements a team needs before moving to pipeline automation in Phase 2.
Why Working Agreements Matter
A working agreement is a shared commitment that the team creates, owns, and enforces together. It is not a policy imposed from outside. It is the team’s own answer to the question: “How do we work together?”
Without working agreements, CD practices drift. One developer integrates daily; another keeps a branch for a week. One developer fixes a broken build immediately; another waits until after lunch. These inconsistencies compound. Within weeks, the team is no longer practicing CD - they are practicing individual preferences.
Working agreements prevent this drift by making expectations explicit. When everyone agrees on what “done” means, what “ready” means, and how CI works, the team can hold each other accountable without conflict.
Definition of Done
The Definition of Done (DoD) is the team’s shared standard for when a work item is complete. For CD, the Definition of Done must include deployment.
Minimum Definition of Done for CD
A work item is done when all of the following are true:
Why “Deployed to Production” Matters
Many teams define “done” as “code is merged.” This creates a gap between “done” and “delivered.” Work accumulates in a staging environment, waiting for a release. Risk grows with each unreleased change.
In a CD organization, “done” means the change is in production (or ready to be deployed to production at any time). This is the ultimate test of completeness: the change works in the real environment, with real data, under real load.
In Phase 1, you may not yet have the pipeline to deploy every change to production automatically. That is fine - your DoD should still include “deployable to production” as the standard, even if the deployment step is not yet automated. The pipeline work in Phase 2 will close that gap.
Extending Your Definition of Done
As your CD maturity grows, extend the DoD:
| Phase |
Addition to DoD |
| Phase 1 (Foundations) |
Code integrated to trunk, tests pass, reviewed, deployable |
| Phase 2 (Pipeline) |
Artifact built and validated by the pipeline |
| Phase 3 (Optimize) |
Change deployed to production behind a feature flag |
| Phase 4 (Deliver on Demand) |
Change deployed to production and monitored |
Definition of Ready
The Definition of Ready (DoR) answers: “When is a work item ready to be worked on?” Pulling unready work into development creates waste - unclear requirements lead to rework, missing acceptance criteria lead to untestable changes, and oversized stories lead to long-lived branches.
Minimum Definition of Ready for CD
A work item is ready when all of the following are true:
Common Mistakes with Definition of Ready
- Making it too rigid. The DoR is a guideline, not a gate. If the team agrees a work item is understood well enough, it is ready. Do not use the DoR to avoid starting work.
- Requiring design documents. For small work items (< 2 days), a conversation and acceptance criteria are sufficient. Formal design documents are for larger initiatives.
- Skipping the conversation. The DoR is most valuable as a prompt for discussion, not as a checklist. The Three Amigos conversation matters more than the checkboxes.
CI Working Agreement
The CI working agreement codifies how the team practices continuous integration. This is the most operationally critical working agreement for CD.
The CI Agreement
The team agrees to the following practices:
Integration:
Build:
Broken builds:
Work in progress:
Why “Broken Build = Top Priority”
This is the single most important CI agreement. When the build is broken:
- No one can integrate safely. Changes are stacking up.
- Trunk is not releasable. The team has lost its safety net.
- Every minute the build stays broken, the team accumulates risk.
“Fix the build” is not a suggestion. It is an agreement that the team enforces collectively. If the build is broken and someone starts a new feature instead of fixing it, the team should call that out. This is not punitive - it is the team protecting its own ability to deliver.
The Revert Rule
If a broken build cannot be fixed within 10 minutes, revert the offending commit and fix the issue on a branch. This keeps trunk green and unblocks the rest of the team. The developer who made the change is not being punished - they are protecting the team’s flow.
Reverting feels uncomfortable at first. Teams worry about “losing work.” But a reverted commit is not lost - the code is still in the Git history. The developer can re-apply their change after fixing the issue. The alternative - a broken trunk for hours while someone debugs - is far more costly.
How Working Agreements Support the CD Migration
Each working agreement maps directly to a Phase 1 practice:
Without these agreements, individual practices exist in isolation. Working agreements connect them into a coherent way of working.
Template: Create Your Own Working Agreements
Use this template as a starting point. Customize it for your team’s context. The specific targets may differ, but the structure should remain.
Team Working Agreement Template
Tips for Creating Working Agreements
- Include everyone. Every team member should participate in creating the agreement. Agreements imposed by a manager or tech lead are policies, not agreements.
- Start simple. Do not try to cover every scenario. Start with the essentials (DoD, DoR, CI) and add specifics as the team identifies gaps.
- Make them visible. Post the agreements where the team sees them daily - on a team wiki, in the team channel, or on a physical board.
- Review regularly. Agreements should evolve as the team matures. Review them monthly. Remove agreements that are second nature. Add agreements for new challenges.
- Enforce collectively. Working agreements are only effective if the team holds each other accountable. This is a team responsibility, not a manager responsibility.
- Start with agreements you can keep. If the team is currently integrating once a week, do not agree to integrate three times daily. Agree to integrate daily, practice for a month, then tighten.
Measuring Success
| Metric |
Target |
Why It Matters |
| Agreement adherence |
Team self-reports > 80% adherence |
Indicates agreements are realistic and followed |
| Agreement review frequency |
Monthly |
Ensures agreements stay relevant |
| Integration frequency |
Meets CI agreement target |
Validates the CI working agreement |
| Broken build fix time |
Meets CI agreement target |
Validates the broken build response agreement |
Next Step
With working agreements in place, your team has established the foundations for continuous delivery: daily integration, reliable testing, automated builds, small work, fast review, and shared commitments.
You are ready to move to Phase 2: Pipeline, where you will build the automated path from commit to production.
This content is adapted from the Dojo Consortium,
licensed under CC BY 4.0.
3.2.7 - Everything as Code
Every artifact that defines your system - infrastructure, pipelines, configuration, database schemas, monitoring - belongs in version control and is delivered through pipelines.
Phase 1 - Foundations
If it is not in version control, it does not exist. If it is not delivered through a pipeline, it
is a manual step. Manual steps block continuous delivery. This page establishes the principle that
everything required to build, deploy, and operate your system is defined as code, version
controlled, reviewed, and delivered through the same automated pipelines as your application.
The Principle
Continuous delivery requires that any change to your system - application code, infrastructure,
pipeline configuration, database schema, monitoring rules, security policies - can be made through
a single, consistent process: change the code, commit, let the pipeline deliver it.
When something is defined as code:
- It is version controlled. You can see who changed what, when, and why. You can revert any
change. You can trace any production state to a specific commit.
- It is reviewed. Changes go through the same review process as application code. A second
pair of eyes catches mistakes before they reach production.
- It is tested. Automated validation catches errors before deployment. Linting, dry-runs,
and policy checks apply to infrastructure the same way unit tests apply to application code.
- It is reproducible. You can recreate any environment from scratch. Disaster recovery is
“re-run the pipeline,” not “find the person who knows how to configure the server.”
- It is delivered through a pipeline. No SSH, no clicking through UIs, no manual steps. The
pipeline is the only path to production for everything, not just application code.
When something is not defined as code, it is a liability. It cannot be reviewed, tested, or
reproduced. It exists only in someone’s head, a wiki page that is already outdated, or a
configuration that was applied manually and has drifted from any documented state.
What “Everything” Means
Application code
This is where most teams start, and it is the least controversial. Your application source code
is in version control, built and tested by a pipeline, and deployed as an immutable artifact.
If your application code is not in version control, start here. Nothing else in this page matters
until this is in place.
Infrastructure
Every server, network, database instance, load balancer, DNS record, and cloud resource should be
defined in code and provisioned through automation.
What this looks like:
- Cloud resources defined in Terraform, Pulumi, CloudFormation, or similar tools
- Server configuration managed by Ansible, Chef, Puppet, or container images
- Network topology, firewall rules, and security groups defined declaratively
- Environment creation is a pipeline run, not a ticket to another team
What this replaces:
- Clicking through cloud provider consoles to create resources
- SSH-ing into servers to install packages or change configuration
- Filing tickets for another team to provision an environment
- “Snowflake” servers that were configured by hand and nobody knows how to recreate
Why it matters for CD: If creating or modifying an environment requires manual steps, your
deployment frequency is limited by the availability and speed of the person who performs those
steps. If a production server fails and you cannot recreate it from code, your mean time to
recovery is measured in hours or days instead of minutes.
Pipeline definitions
Your CI/CD pipeline configuration belongs in the same repository as the code it builds and
deploys. The pipeline is code, not a configuration applied through a UI.
What this looks like:
- Pipeline definitions in
.github/workflows/, .gitlab-ci.yml, Jenkinsfile, or equivalent
- Pipeline changes go through the same review process as application code
- Pipeline behavior is deterministic - the same commit always produces the same pipeline behavior
- Teams can modify their own pipelines without filing tickets
What this replaces:
- Pipeline configuration maintained through a Jenkins UI that nobody is allowed to touch
- A “platform team” that owns all pipeline definitions and queues change requests
- Pipeline behavior that varies depending on server state or installed plugins
Why it matters for CD: The pipeline is the path to production. If the pipeline itself cannot
be changed through a reviewed, automated process, it becomes a bottleneck and a risk. Pipeline
changes should flow with the same speed and safety as application changes.
Database schemas and migrations
Database schema changes should be defined as versioned migration scripts, stored in version
control, and applied through the pipeline.
What this looks like:
- Migration scripts in the repository (using tools like Flyway, Liquibase, Alembic, or
ActiveRecord migrations)
- Every schema change is a numbered, ordered migration that can be applied and rolled back
- Migrations run as part of the deployment pipeline, not as a manual step
- Schema changes follow the expand-then-contract pattern: add the new column, deploy code that
uses it, then remove the old column in a later migration
What this replaces:
- A DBA manually applying SQL scripts during a maintenance window
- Schema changes that are “just done in production” and not tracked anywhere
- Database state that has drifted from what is defined in any migration script
Why it matters for CD: Database changes are one of the most common reasons teams cannot deploy
continuously. If schema changes require manual intervention, coordinated downtime, or a separate
approval process, they become a bottleneck that forces batching. Treating schemas as code with
automated migrations removes this bottleneck.
Application configuration
Environment-specific configuration - database connection strings, API endpoints, feature flag
states, logging levels - should be defined as code and managed through version control.
What this looks like:
- Configuration values stored in a config management system (Consul, AWS Parameter Store,
environment variable definitions in infrastructure code)
- Configuration changes are committed, reviewed, and deployed through a pipeline
- The same application artifact is deployed to every environment; only the configuration differs
What this replaces:
- Configuration files edited manually on servers
- Environment variables set by hand and forgotten
- Configuration that exists only in a deployment runbook
See Application Config for detailed guidance on
externalizing configuration.
Monitoring, alerting, and observability
Dashboards, alert rules, SLO definitions, and logging configuration should be defined as code.
What this looks like:
- Alert rules defined in Terraform, Prometheus rules files, or Datadog monitors-as-code
- Dashboards defined as JSON or YAML, not built by hand in a UI
- SLO definitions tracked in version control alongside the services they measure
- Logging configuration (what to log, where to send it, retention policies) in code
What this replaces:
- Dashboards built manually in a monitoring UI that nobody knows how to recreate
- Alert rules that were configured by hand during an incident and never documented
- Monitoring configuration that exists only on the monitoring server
Why it matters for CD: If you deploy ten times a day, you need to know instantly whether each
deployment is healthy. If your monitoring and alerting configuration is manual, it will drift,
break, or be incomplete. Monitoring-as-code ensures that every service has consistent, reviewed,
reproducible observability.
Security policies
Security controls - access policies, network rules, secret rotation schedules, compliance
checks - should be defined as code and enforced automatically.
What this looks like:
- IAM policies and RBAC rules defined in Terraform or policy-as-code tools (OPA, Sentinel)
- Security scanning integrated into the pipeline (SAST, dependency scanning, container image
scanning)
- Secret rotation automated and defined in code
- Compliance checks that run on every commit, not once a quarter
What this replaces:
- Security reviews that happen at the end of the development cycle
- Access policies configured through UIs and never audited
- Compliance as a manual checklist performed before each release
Why it matters for CD: Security and compliance requirements are the most common organizational
blockers for CD. When security controls are defined as code and enforced by the pipeline, you can
prove to auditors that every change passed security checks automatically. This is stronger
evidence than a manual review, and it does not slow down delivery.
The “One Change, One Process” Test
For every type of artifact in your system, ask:
If I need to change this, do I commit a code change and let the pipeline deliver it?
If the answer is yes, the artifact is managed as code. If the answer involves SSH, a UI, a
ticket to another team, or a manual step, it is not.
| Artifact |
Managed as code? |
If not, the risk is… |
| Application source code |
Usually yes |
- |
| Infrastructure (servers, networks, cloud resources) |
Often no |
Snowflake environments, slow provisioning, unreproducible disasters |
| Pipeline definitions |
Sometimes |
Pipeline changes are slow, unreviewed, and risky |
| Database schemas |
Sometimes |
Schema changes require manual coordination and downtime |
| Application configuration |
Sometimes |
Config drift between environments, “works in staging” failures |
| Monitoring and alerting |
Rarely |
Monitoring gaps, unreproducible dashboards, alert fatigue |
| Security policies |
Rarely |
Security as a gate instead of a guardrail, audit failures |
The goal is for every row in this table to be “yes.” You will not get there overnight, but every
artifact you move from manual to code-managed removes a bottleneck and a risk.
How to Get There
Start with what blocks you most
Do not try to move everything to code at once. Identify the artifact type that causes the most
pain or blocks deployments most frequently:
- If environment provisioning takes days, start with infrastructure as code.
- If database changes are the reason you cannot deploy more than once a week, start with
schema migrations as code.
- If pipeline changes require tickets to a platform team, start with pipeline as code.
- If configuration drift causes production incidents, start with configuration as code.
Apply the same practices as application code
Once an artifact is defined as code, treat it with the same rigor as application code:
- Store it in version control (ideally in the same repository as the application it supports)
- Review changes before they are applied
- Test changes automatically (linting, dry-runs, policy checks)
- Deliver changes through a pipeline
- Never modify the artifact outside of this process
Eliminate manual pathways
The hardest part is closing the manual back doors. As long as someone can SSH into a server and
make a change, or click through a UI to modify infrastructure, the code-defined state will drift
from reality.
The principle is the same as Single Path to Production
for application code: the pipeline is the only way any change reaches production. This applies to
infrastructure, configuration, schemas, monitoring, and policies just as much as it applies to
application code.
Measuring Progress
| Metric |
What to look for |
| Artifact types managed as code |
Track how many of the categories above are fully code-managed. The number should increase over time. |
| Manual changes to production |
Count any change made outside of a pipeline (SSH, UI clicks, manual scripts). Target: zero. |
| Environment recreation time |
How long does it take to recreate a production-like environment from scratch? Should decrease as more infrastructure moves to code. |
| Mean time to recovery |
When infrastructure-as-code is in place, recovery from failures is “re-run the pipeline.” MTTR drops dramatically. |
Related Content
3.3 - Phase 2: Pipeline
Build the automated path from commit to production: a single, deterministic pipeline that deploys immutable artifacts.
Key question: “Can we deploy any commit automatically?”
This phase creates the delivery pipeline - the automated path that takes every commit
through build, test, and deployment stages. When done right, the pipeline is the only
way changes reach production.
What You’ll Do
- Establish a single path to production - One pipeline for all changes
- Make the pipeline deterministic - Same inputs always produce same outputs
- Define “deployable” - Clear criteria for what’s ready to ship
- Use immutable artifacts - Build once, deploy everywhere
- Externalize application config - Separate config from code
- Use production-like environments - Test in environments that match production
- Design your pipeline architecture - Efficient quality gates for your context
- Enable rollback - Fast recovery from any deployment
Why This Phase Matters
The pipeline is the backbone of continuous delivery. It replaces manual handoffs with
automated quality gates, ensures every change goes through the same validation process,
and makes deployment a routine, low-risk event.
When You’re Ready to Move On
You’re ready for Phase 3: Optimize when:
- Every change reaches production through the same automated pipeline
- The pipeline produces the same result for the same inputs
- You can deploy any green build to production with confidence
- Rollback takes minutes, not hours
3.3.1 - Single Path to Production
All changes reach production through the same automated pipeline - no exceptions.
Phase 2 - Pipeline | Adapted from MinimumCD.org
Definition
A single path to production means that every change - whether it is a feature, a bug fix,
a configuration update, or an infrastructure change - follows the same automated pipeline
to reach production. There is exactly one route from a developer’s commit to a running
production system. No side doors. No emergency shortcuts. No “just this once” manual
deployments.
This is the most fundamental constraint of a continuous delivery pipeline. If you allow
multiple paths, you cannot reason about the state of production. You lose the ability to
guarantee that every change has been validated, and you undermine every other practice in
this phase.
Why It Matters for CD Migration
Teams migrating to continuous delivery often carry legacy deployment processes - a manual
runbook for “emergency” fixes, a separate path for database changes, or a distinct
workflow for infrastructure updates. Each additional path is a source of unvalidated risk.
Establishing a single path to production is the first pipeline practice because every
subsequent practice depends on it. A deterministic pipeline
only works if all changes flow through it. Immutable artifacts
are only trustworthy if no other mechanism can alter what reaches production. Your
deployable definition is meaningless if changes can bypass
the gates.
Key Principles
One pipeline for all changes
Every type of change uses the same pipeline:
- Application code - features, fixes, refactors
- Infrastructure as Code - Terraform, CloudFormation, Pulumi, Ansible
- Pipeline definitions - the pipeline itself is versioned and deployed through the pipeline
- Configuration changes - environment variables, feature flags, routing rules
- Database migrations - schema changes, data migrations
Same pipeline for all environments
The pipeline that deploys to development is the same pipeline that deploys to staging and
production. The only difference between environments is the configuration injected at
deployment time. If your staging deployment uses a different mechanism than your production
deployment, you are not testing the deployment process itself.
No manual deployments
If a human can bypass the pipeline and push a change directly to production, the single
path is broken. This includes:
- SSH access to production servers for ad-hoc changes
- Direct container image pushes outside the pipeline
- Console-based configuration changes that are not captured in version control
- “Break glass” procedures that skip validation stages
Anti-Patterns
Integration branches and multi-branch deployment paths
Using separate branches (such as develop, release, hotfix) that each have their own
deployment workflow creates multiple paths. GitFlow is a common source of this anti-pattern.
When a hotfix branch deploys through a different pipeline than the develop branch, you
cannot be confident that the hotfix has undergone the same validation.
Environment-specific pipelines
Building a separate pipeline for staging versus production - or worse, manually deploying
to staging and only using automation for production - means you are not testing your
deployment process in lower environments.
“Emergency” manual deployments
The most dangerous anti-pattern is the manual deployment reserved for emergencies. Under
pressure, teams bypass the pipeline “just this once,” introducing an unvalidated change
into production. The fix for this is not to allow exceptions - it is to make the pipeline
fast enough that it is always the fastest path to production.
Separate pipelines for different change types
Having one pipeline for application code, another for infrastructure, and yet another for
database changes means that coordinated changes across these layers are never validated
together.
Good Patterns
Feature flags
Use feature flags to decouple deployment from release. Code can be merged and deployed
through the pipeline while the feature remains hidden behind a flag. This eliminates the
need for long-lived branches and separate deployment paths for “not-ready” features.
Branch by abstraction
For large-scale refactors or technology migrations, use branch by abstraction to make
incremental changes that can be deployed through the standard pipeline at every step.
Create an abstraction layer, build the new implementation behind it, switch over
incrementally, and remove the old implementation - all through the same pipeline.
Dark launching
Deploy new functionality to production without exposing it to users. The code runs in
production, processes real data, and generates real metrics - but its output is not shown
to users. This validates the change under production conditions while managing risk.
Connect tests last
When building a new integration, start by deploying the code without connecting it to the
live dependency. Validate the deployment, the configuration, and the basic behavior first.
Connect to the real dependency as the final step. This keeps the change deployable through
the pipeline at every stage of development.
How to Get Started
Step 1: Map your current deployment paths
Document every way that changes currently reach production. Include manual processes,
scripts, CI/CD pipelines, direct deployments, and any emergency procedures. You will
likely find more paths than you expected.
Step 2: Identify the primary path
Choose or build one pipeline that will become the single path. This pipeline should be
the most automated and well-tested path you have. All other paths will converge into it.
Step 3: Eliminate the easiest alternate paths first
Start by removing the deployment paths that are used least frequently or are easiest to
replace. For each path you eliminate, migrate its changes into the primary pipeline.
Step 4: Make the pipeline fast enough for emergencies
The most common reason teams maintain manual deployment shortcuts is that the pipeline is
too slow for urgent fixes. If your pipeline takes 45 minutes and an incident requires a
fix in 10, the team will bypass the pipeline. Invest in pipeline speed so that the
automated path is always the fastest option.
Step 5: Remove break-glass access
Once the pipeline is fast and reliable, remove the ability to deploy outside of it.
Revoke direct production access. Disable manual deployment scripts. Make the pipeline the
only way.
Connection to the Pipeline Phase
Single path to production is the foundation of Phase 2. Without it, every other pipeline
practice is compromised:
- Deterministic pipeline requires all changes to flow through it to provide guarantees
- Deployable definition must be enforced by a single set of gates
- Immutable artifacts are only trustworthy when produced by a known, consistent process
- Rollback relies on the pipeline to deploy the previous version through the same path
Establishing this practice first creates the constraint that makes the rest of the
pipeline meaningful.
This content is adapted from MinimumCD.org,
licensed under CC BY 4.0.
3.3.2 - Deterministic Pipeline
The same inputs to the pipeline always produce the same outputs.
Phase 2 - Pipeline | Adapted from MinimumCD.org
Definition
A deterministic pipeline produces consistent, repeatable results. Given the same commit,
the same environment definition, and the same configuration, the pipeline will build the
same artifact, run the same tests, and produce the same outcome - every time. There is no
variance introduced by uncontrolled dependencies, environmental drift, manual
intervention, or non-deterministic test behavior.
Determinism is what transforms a pipeline from “a script that usually works” into a
reliable delivery system. When the pipeline is deterministic, a green build means
something. A failed build points to a real problem. Teams can trust the signal.
Why It Matters for CD Migration
Non-deterministic pipelines are the single largest source of wasted time in delivery
organizations. When builds fail randomly, teams learn to ignore failures. When the same
commit passes on retry, teams stop investigating root causes. When different environments
produce different results, teams lose confidence in pre-production validation.
During a CD migration, teams are building trust in automation. Every flaky test, every
“works on my machine” failure, and every environment-specific inconsistency erodes that
trust. A deterministic pipeline is what earns the team’s confidence that automation can
replace manual verification.
Key Principles
Version control everything
Every input to the pipeline must be version controlled:
- Application source code - the obvious one
- Infrastructure as Code - the environment definitions themselves
- Pipeline definitions - the CI/CD configuration files
- Test data and fixtures - the data used by automated tests
- Dependency lockfiles - exact versions of every dependency (e.g.,
package-lock.json, Pipfile.lock, go.sum)
- Tool versions - the versions of compilers, runtimes, linters, and build tools
If an input to the pipeline is not version controlled, it can change without notice, and
the pipeline is no longer deterministic.
Lock dependency versions
Floating dependency versions (version ranges, “latest” tags) are a common source of
non-determinism. A build that worked yesterday can break today because a transitive
dependency released a new version overnight.
Use lockfiles to pin exact versions of every dependency. Commit lockfiles to version
control. Update dependencies intentionally through pull requests, not implicitly through
builds.
Eliminate environmental variance
The pipeline should run in a controlled, reproducible environment. Containerize build
steps so that the build environment is defined in code and does not drift over time. Use
the same base images in CI as in production. Pin tool versions explicitly rather than
relying on whatever is installed on the build agent.
Remove human intervention
Any manual step in the pipeline is a source of variance. A human choosing which tests to
run, deciding whether to skip a stage, or manually approving a step introduces
non-determinism. The pipeline should run from commit to deployment without human
decisions.
This does not mean humans have no role - it means the pipeline’s behavior is fully
determined by its inputs, not by who is watching it run.
A flaky test is a test that sometimes passes and sometimes fails for the same code. Flaky
tests are the most insidious form of non-determinism because they train teams to distrust
the test suite.
When a flaky test is detected, the response must be immediate:
- Quarantine the test - remove it from the pipeline so it does not block other changes
- Fix it or delete it - flaky tests provide negative value; they are worse than no test
- Investigate the root cause - flakiness often indicates a real problem (race conditions, shared state, time dependencies, external service reliance)
Never allow a culture of “just re-run it” to take hold. Every re-run masks a real problem.
Anti-Patterns
Unpinned dependencies
Using version ranges like ^1.2.0 or >=2.0 in dependency declarations without a
lockfile means the build resolves different versions on different days. This applies to
application dependencies, build plugins, CI tool versions, and base container images.
Shared, mutable build environments
Build agents that accumulate state between builds (cached files, installed packages,
leftover containers) produce different results depending on what ran previously. Each
build should start from a clean, known state.
Tests that depend on external services
Tests that call live external APIs, depend on shared databases, or rely on network
resources introduce uncontrolled variance. External services change, experience outages,
and respond with different latency - all of which make the pipeline non-deterministic.
Time-dependent tests
Tests that depend on the current time, current date, or elapsed time are inherently
non-deterministic. A test that passes at 2:00 PM and fails at midnight is not testing
your application - it is testing the clock.
Manual retry culture
Teams that routinely re-run failed pipelines without investigating the failure have
accepted non-determinism as normal. This is a cultural anti-pattern that must be
addressed alongside the technical ones.
Good Patterns
Containerized build environments
Define your build environment as a container image. Pin the base image version. Install
exact versions of all tools. Run every build in a fresh instance of this container. This
eliminates variance from the build environment.
Hermetic builds
A hermetic build is one that does not access the network during the build process. All
dependencies are pre-fetched and cached. The build can run identically on any machine, at
any time, with or without network access.
Contract tests for external dependencies
Replace live calls to external services with contract tests. These tests verify that your
code interacts correctly with an external service’s API contract without actually calling
the service. Combine with service virtualization or test doubles for integration tests.
Deterministic test ordering
Run tests in a fixed, deterministic order - or better, ensure every test is independent
and can run in any order. Many test frameworks default to random ordering to detect
inter-test dependencies; use this during development but ensure no ordering dependencies
exist.
Immutable CI infrastructure
Treat CI build agents as cattle, not pets. Provision them from images. Replace them
rather than updating them. Never allow state to accumulate on a build agent between
pipeline runs.
How to Get Started
List every input to your pipeline that is not version controlled. This includes
dependency versions, tool versions, environment configurations, test data, and pipeline
definitions themselves.
Step 2: Add lockfiles and pin versions
For every dependency manager in your project, ensure a lockfile is committed to version
control. Pin CI tool versions explicitly. Pin base image versions in Dockerfiles.
Step 3: Containerize the build
Move your build steps into containers with explicitly defined environments. This is often
the highest-leverage change for improving determinism.
Step 4: Identify and fix flaky tests
Review your test history for tests that have both passed and failed for the same commit.
Quarantine them immediately and fix or remove them within a defined time window (such as
one sprint).
Step 5: Monitor pipeline determinism
Track the rate of pipeline failures that are resolved by re-running without code changes.
This metric (sometimes called the “re-run rate”) directly measures non-determinism. Drive
it to zero.
Connection to the Pipeline Phase
Determinism is what gives the single path to production
its authority. If the pipeline produces inconsistent results, teams will work around it.
A deterministic pipeline is also the prerequisite for a meaningful
deployable definition - your quality gates are only as
reliable as the pipeline that enforces them.
When the pipeline is deterministic, immutable artifacts become
trustworthy: you know that the artifact was built by a consistent, repeatable process, and
its validation results are real.
This content is adapted from MinimumCD.org,
licensed under CC BY 4.0.
3.3.3 - Deployable Definition
Clear, automated criteria that determine when a change is ready for production.
Phase 2 - Pipeline | Adapted from MinimumCD.org
Definition
A deployable definition is the set of automated quality criteria that every artifact must
satisfy before it is considered ready for production. It is the pipeline’s answer to the
question: “How do we know this is safe to deploy?”
This is not a checklist that a human reviews. It is a set of automated gates - executable
validations built into the pipeline - that every change must pass. If the pipeline is
green, the artifact is deployable. If the pipeline is red, it is not. There is no
ambiguity, no judgment call, and no “looks good enough.”
Why It Matters for CD Migration
Without a clear, automated deployable definition, teams rely on human judgment to decide
when something is ready to ship. This creates bottlenecks (waiting for approval), variance
(different people apply different standards), and fear (nobody is confident the change is
safe). All three are enemies of continuous delivery.
During a CD migration, the deployable definition replaces manual approval processes with
automated confidence. It is what allows a team to say “any green build can go to
production” - which is the prerequisite for continuous deployment.
Key Principles
The definition must be automated
Every criterion in the deployable definition is enforced by an automated check in the
pipeline. If a requirement cannot be automated, either find a way to automate it or
question whether it belongs in the deployment path.
The definition must be comprehensive
The deployable definition should cover all dimensions of quality that matter for
production readiness:
Security
- Static Application Security Testing (SAST) - scan source code for known vulnerability patterns
- Dependency vulnerability scanning - check all dependencies against known vulnerability databases (CVE lists)
- Secret detection - verify that no credentials, API keys, or tokens are present in the codebase
- Container image scanning - if deploying containers, scan images for known vulnerabilities
- License compliance - verify that dependency licenses are compatible with your distribution requirements
Functionality
- Unit tests - fast, isolated tests that verify individual components behave correctly
- Integration tests - tests that verify components work together correctly
- End-to-end tests - tests that verify the system works from the user’s perspective
- Regression tests - tests that verify previously fixed defects have not reappeared
- Contract tests - tests that verify APIs conform to their published contracts
Compliance
- Audit trail - the pipeline itself produces the compliance artifact: who changed what, when, and what validations it passed
- Policy as code - organizational policies (e.g., “no deployments on Friday”) encoded as pipeline logic
- Change documentation - automatically generated from commit metadata and pipeline results
- Performance benchmarks - verify that key operations complete within acceptable thresholds
- Load test baselines - verify that the system handles expected load without degradation
- Resource utilization checks - verify that the change does not introduce memory leaks or excessive CPU usage
Reliability
- Health check validation - verify that the application starts up correctly and responds to health checks
- Graceful degradation tests - verify that the system behaves acceptably when dependencies fail
- Rollback verification - verify that the deployment can be rolled back (see Rollback)
Code Quality
- Linting and static analysis - enforce code style and detect common errors
- Code coverage thresholds - not as a target, but as a safety net to detect large untested areas
- Complexity metrics - flag code that exceeds complexity thresholds for review
The definition must be fast
A deployable definition that takes hours to evaluate will not support continuous delivery.
The entire pipeline - including all deployable definition checks - should complete in
minutes, not hours. This often requires running checks in parallel, investing in test
infrastructure, and making hard choices about which slow checks provide enough value to
keep.
The definition must be maintained
The deployable definition is a living document. As the system evolves, new failure modes
emerge, and the definition should be updated to catch them. When a production incident
occurs, the team should ask: “What automated check could have caught this?” and add it to
the definition.
Anti-Patterns
Manual approval gates
Requiring a human to review and approve a deployment after the pipeline has passed all
automated checks is an anti-pattern. It adds latency, creates bottlenecks, and implies
that the automated checks are not sufficient. If a human must approve, it means your
automated definition is incomplete - fix the definition rather than adding a manual gate.
“Good enough” tolerance
Allowing deployments when some checks fail because “that test always fails” or “it is
only a warning” degrades the deployable definition to meaninglessness. Either the check
matters and must pass, or it does not matter and should be removed.
Post-deployment validation only
Running validation only after deployment to production (production smoke tests, manual
QA in production) means you are using production users to find problems. Pre-deployment
validation must be comprehensive enough that post-deployment checks are a safety net, not
the primary quality gate.
Inconsistent definitions across teams
When different teams have different deployable definitions, organizational confidence
in deployment varies. While the specific checks may differ by service, the categories of
validation (security, functionality, performance, compliance) should be consistent.
Good Patterns
Pipeline gates as policy
Encode the deployable definition as pipeline stages that block progression. A change
cannot move from build to test, or from test to deployment, unless the preceding stage
passes completely. The pipeline enforces the definition; no human override is possible.
Shift-left validation
Run the fastest, most frequently failing checks first. Unit tests and linting run before
integration tests. Integration tests run before end-to-end tests. Security scans run in
parallel with test stages. This gives developers the fastest possible feedback.
Continuous definition improvement
After every production incident, add or improve a check in the deployable definition that
would have caught the issue. Over time, the definition becomes a comprehensive record of
everything the team has learned about quality.
Visible, shared definitions
Make the deployable definition visible to all team members. Display the current pipeline
status on dashboards. When a check fails, provide clear, actionable feedback about what
failed and why. The definition should be understood by everyone, not hidden in pipeline
configuration.
How to Get Started
Step 1: Document your current “definition of done”
Write down every check that currently happens before a deployment - automated or manual.
Include formal checks (tests, scans) and informal ones (someone eyeballs the logs,
someone clicks through the UI).
Step 2: Classify each check
For each check, determine: Is it automated? Is it fast? Is it reliable? Is it actually
catching real problems? This reveals which checks are already pipeline-ready and which
need work.
Step 3: Automate the manual checks
For every manual check, determine how to automate it. A human clicking through the UI
becomes an end-to-end test. A human reviewing logs becomes an automated log analysis step.
A manager approving a deployment becomes a set of automated policy checks.
Step 4: Build the pipeline gates
Organize your automated checks into pipeline stages. Fast checks first, slower checks
later. All checks must pass for the artifact to be considered deployable.
Step 5: Remove manual approvals
Once the automated definition is comprehensive enough that a green build genuinely means
“safe to deploy,” remove manual approval gates. This is often the most culturally
challenging step.
Connection to the Pipeline Phase
The deployable definition is the contract between the pipeline and the organization. It is
what makes the single path to production trustworthy -
because every change that passes through the path has been validated against a clear,
comprehensive standard.
Combined with a deterministic pipeline, the deployable
definition ensures that green means green and red means red. Combined with
immutable artifacts, it ensures that the artifact you validated
is the artifact you deploy. It is the bridge between automated process and organizational
confidence.
This content is adapted from MinimumCD.org,
licensed under CC BY 4.0.
3.3.4 - Immutable Artifacts
Build once, deploy everywhere. The same artifact is used in every environment.
Phase 2 - Pipeline | Adapted from MinimumCD.org
Definition
An immutable artifact is a build output that is created exactly once and deployed to every
environment without modification. The binary, container image, or package that runs in
production is byte-for-byte identical to the one that passed through testing. Nothing is
recompiled, repackaged, or altered between environments.
“Build once, deploy everywhere” is the core principle. The artifact is sealed at build
time. Configuration is injected at deployment time (see
Application Configuration), but the artifact itself never
changes.
Why It Matters for CD Migration
If you build a separate artifact for each environment - or worse, make manual adjustments
to artifacts at deployment time - you can never be certain that what you tested is what
you deployed. Every rebuild introduces the possibility of variance: a different dependency
resolved, a different compiler flag applied, a different snapshot of the source.
Immutable artifacts eliminate an entire class of “works in staging, fails in production”
problems. They provide confidence that the pipeline results are real: the artifact that
passed every quality gate is the exact artifact running in production.
For teams migrating to CD, this practice is a concrete, mechanical step that delivers
immediate trust. Once the team sees that the same container image flows from CI to
staging to production, the deployment process becomes verifiable instead of hopeful.
Key Principles
Build once
The artifact is produced exactly once, during the build stage of the pipeline. It is
stored in an artifact repository (such as a container registry, Maven repository, npm
registry, or object store) and every subsequent stage of the pipeline - and every
environment - pulls and deploys that same artifact.
No manual adjustments
Artifacts are never modified after creation. This means:
- No recompilation for different environments
- No patching binaries in staging to fix a test failure
- No adding environment-specific files into a container image after the build
- No editing properties files inside a deployed artifact
Version everything that goes into the build
Because the artifact is built once and cannot be changed, every input must be correct at
build time:
- Source code - committed to version control at a specific commit hash
- Dependencies - locked to exact versions via lockfiles
- Build tools - pinned to specific versions
- Build configuration - stored in version control alongside the source
Tag and trace
Every artifact must be traceable back to the exact commit, pipeline run, and set of inputs
that produced it. Use content-addressable identifiers (such as container image digests),
semantic version tags, or build metadata that links the artifact to its source.
Anti-Patterns
Rebuilding per environment
Building the artifact separately for development, staging, and production - even from the
same source - means each artifact is a different build. Different builds can produce
different results due to non-deterministic build processes, updated dependencies, or
changed build environments.
SNAPSHOT or mutable versions
Using version identifiers like -SNAPSHOT (Maven), latest (container images), or
unversioned “current” references means the same version label can point to different
artifacts at different times. This makes it impossible to know exactly what is deployed.
Manual intervention at failure points
When a deployment fails, the fix must go through the pipeline. Manually patching the
artifact, restarting with modified configuration, or applying a hotfix directly to the
running system breaks immutability and bypasses the quality gates.
Environment-specific builds
Build scripts that use conditionals like “if production, include X” create
environment-coupled artifacts. The artifact should be environment-agnostic;
environment configuration handles the differences.
Artifacts that self-modify
Applications that write to their own deployment directory, modify their own configuration
files at runtime, or store state alongside the application binary are not truly immutable.
Runtime state must be stored externally.
Good Patterns
Container images as immutable artifacts
Container images are an excellent vehicle for immutable artifacts. A container image built
in CI, pushed to a registry with a content-addressable digest, and pulled into each
environment is inherently immutable. The image that ran in staging is provably identical
to the image running in production.
Instead of rebuilding for each environment, promote the same artifact through environments.
The pipeline builds the artifact once, deploys it to a test environment, validates it,
then promotes it (deploys the same artifact) to staging, then production. The artifact
never changes; only the environment it runs in changes.
Content-addressable storage
Use content-addressable identifiers (SHA-256 digests, content hashes) rather than mutable
tags as the primary artifact reference. A content-addressed artifact is immutable by
definition: changing any byte changes the address.
Signed artifacts
Digitally sign artifacts at build time and verify the signature before deployment. This
guarantees that the artifact has not been tampered with between the build and the
deployment. This is especially important for supply chain security.
Reproducible builds
Strive for builds where the same source input produces a bit-for-bit identical artifact.
While not always achievable (timestamps, non-deterministic linkers), getting close makes
it possible to verify that an artifact was produced from its claimed source.
How to Get Started
Step 1: Separate build from deployment
If your pipeline currently rebuilds for each environment, restructure it into two
distinct phases: a build phase that produces a single artifact, and a deployment phase that
takes that artifact and deploys it to a target environment with the appropriate
configuration.
Step 2: Set up an artifact repository
Choose an artifact repository appropriate for your technology stack - a container registry
for container images, a package registry for libraries, or an object store for compiled
binaries. All downstream pipeline stages pull from this repository.
Step 3: Eliminate mutable version references
Replace latest tags, -SNAPSHOT versions, and any other mutable version identifier
with immutable references. Use commit-hash-based tags, semantic versions, or
content-addressable digests.
Modify your pipeline to deploy the same artifact to each environment in sequence. The
pipeline should pull the artifact from the repository by its immutable identifier and
deploy it without modification.
Step 5: Add traceability
Ensure every deployed artifact can be traced back to its source commit, build log, and
pipeline run. Label container images with build metadata. Store build provenance alongside
the artifact in the repository.
Step 6: Verify immutability
Periodically verify that what is running in production matches what the pipeline built.
Compare image digests, checksums, or signatures. This catches any manual modifications
that may have bypassed the pipeline.
Connection to the Pipeline Phase
Immutable artifacts are the physical manifestation of trust in the pipeline. The
single path to production ensures all changes flow
through the pipeline. The deterministic pipeline ensures the
build is repeatable. The deployable definition ensures the
artifact meets quality criteria. Immutability ensures that the validated artifact - and
only that artifact - reaches production.
This practice also directly supports rollback: because previous artifacts
are stored unchanged in the artifact repository, rolling back is simply deploying a
previous known-good artifact.
This content is adapted from MinimumCD.org,
licensed under CC BY 4.0.
3.3.5 - Application Configuration
Separate configuration from code so the same artifact works in every environment.
Phase 2 - Pipeline | Adapted from MinimumCD.org
Definition
Application configuration is the practice of correctly separating what varies between
environments from what does not, so that a single immutable artifact
can run in any environment. This distinction - drawn from the
Twelve-Factor App methodology - is essential for
continuous delivery.
There are two distinct types of configuration:
-
Application config - settings that define how the application behaves, are the same
in every environment, and should be bundled with the artifact. Examples: routing rules,
feature flag defaults, serialization formats, timeout policies, retry strategies.
-
Environment config - settings that vary by deployment target and must be injected at
deployment time. Examples: database connection strings, API endpoint URLs, credentials,
resource limits, logging levels for that environment.
Getting this distinction right is critical. Bundling environment config into the artifact
breaks immutability. Externalizing application config that does not vary creates
unnecessary complexity and fragility.
Why It Matters for CD Migration
Configuration is where many CD migrations stall. Teams that have been deploying manually
often have configuration tangled with code - hardcoded URLs, environment-specific build
profiles, configuration files that are manually edited during deployment. Untangling this
is a prerequisite for immutable artifacts and automated deployments.
When configuration is handled correctly, the same artifact flows through every environment
without modification, environment-specific values are injected at deployment time, and
feature behavior can be changed without redeploying. This enables the deployment speed and
safety that continuous delivery requires.
Key Principles
Bundle what does not vary
Application configuration that is identical across all environments belongs inside the
artifact. This includes:
- Default feature flag values - the static, compile-time defaults for feature flags
- Application routing and mapping rules - URL patterns, API route definitions
- Serialization and encoding settings - JSON configuration, character encoding
- Internal timeout and retry policies - backoff strategies, circuit breaker thresholds
- Validation rules - input validation constraints, business rule parameters
These values are part of the application’s behavior definition. They should be version
controlled with the source code and deployed as part of the artifact.
Externalize what varies
Environment configuration that changes between deployment targets must be injected at
deployment time:
- Database connection strings - different databases for test, staging, production
- External service URLs - different endpoints for downstream dependencies
- Credentials and secrets - always injected, never bundled, never in version control
- Resource limits - memory, CPU, connection pool sizes tuned per environment
- Environment-specific logging levels - verbose in development, structured in production
- Feature flag overrides - dynamic flag values managed by an external flag service
Feature flags: static vs. dynamic
Feature flags deserve special attention because they span both categories:
-
Static feature flags - compiled into the artifact as default values. They define the
initial state of a feature when the application starts. Changing them requires a new
build and deployment.
-
Dynamic feature flags - read from an external service at runtime. They can be
toggled without deploying. Use these for operational toggles (kill switches, gradual
rollouts) and experiment flags (A/B tests).
A well-designed feature flag system uses static defaults (bundled in the artifact) that can
be overridden by a dynamic source (external flag service). If the flag service is
unavailable, the application falls back to its static defaults - a safe, predictable
behavior.
Anti-Patterns
Hardcoded environment-specific values
Database URLs, API endpoints, or credentials embedded directly in source code or
configuration files that are baked into the artifact. This forces a different build per
environment and makes secrets visible in version control.
Externalizing everything
Moving all configuration to an external service - including values that never change
between environments - creates unnecessary runtime dependencies. If the configuration
service is down and a value that is identical in every environment cannot be read, the
application fails to start for no good reason.
Environment-specific build profiles
Build systems that use profiles like mvn package -P production or Webpack configurations
that toggle behavior based on NODE_ENV at build time create environment-coupled
artifacts. The artifact must be the same regardless of where it will run.
Configuration files edited during deployment
Manually editing application.properties, .env files, or YAML configurations on the
server during or after deployment is error-prone, unrepeatable, and invisible to the
pipeline. All configuration injection must be automated.
Secrets in version control
Credentials, API keys, certificates, and tokens must never be stored in version control -
not even in “private” repositories, not even encrypted with simple mechanisms. Use a
secrets manager (Vault, AWS Secrets Manager, Azure Key Vault) and inject secrets at
deployment time.
Good Patterns
Environment variables for environment config
Following the Twelve-Factor App approach, inject environment-specific values as
environment variables. This is universally supported across languages and platforms, works
with containers and orchestrators, and keeps the artifact clean.
Layered configuration
Use a configuration framework that supports layering:
- Defaults - bundled in the artifact (application config)
- Environment overrides - injected via environment variables or mounted config files
- Dynamic overrides - read from a feature flag service or configuration service at runtime
Each layer overrides the previous one. The application always has a working default, and
environment-specific or dynamic values override only what needs to change.
Config maps and secrets in orchestrators
Kubernetes ConfigMaps and Secrets (or equivalent mechanisms in other orchestrators)
provide a clean separation between the artifact (the container image) and the
environment-specific configuration. The image is immutable; the configuration is injected
at pod startup.
Secrets management with rotation
Use a dedicated secrets manager that supports automatic rotation, audit logging, and
fine-grained access control. The application retrieves secrets at startup or on-demand,
and the secrets manager handles rotation without requiring redeployment.
Configuration validation at startup
The application should validate its configuration at startup and fail fast with a clear
error message if required configuration is missing or invalid. This catches configuration
errors immediately rather than allowing the application to start in a broken state.
How to Get Started
Step 1: Inventory your configuration
List every configuration value your application uses. For each one, determine: Does this
value change between environments? If yes, it is environment config. If no, it is
application config.
Step 2: Move environment config out of the artifact
For every environment-specific value currently bundled in the artifact (hardcoded URLs,
build profiles, environment-specific property files), extract it and inject it via
environment variable, config map, or secrets manager.
Step 3: Bundle application config with the code
For every value that does not vary between environments, ensure it is committed to version
control alongside the source code and included in the artifact at build time. Remove it
from any external configuration system where it adds unnecessary complexity.
Step 4: Implement feature flags properly
Set up a feature flag framework with static defaults in the code and an external flag
service for dynamic overrides. Ensure the application degrades gracefully if the flag
service is unavailable.
Step 5: Remove environment-specific build profiles
Eliminate any build-time branching based on target environment. The build produces one
artifact. Period.
Step 6: Automate configuration injection
Ensure that configuration injection is fully automated in the deployment pipeline. No
human should manually set environment variables or edit configuration files during
deployment.
Connection to the Pipeline Phase
Application configuration is the enabler that makes
immutable artifacts practical. An artifact can only be truly
immutable if it does not contain environment-specific values that would need to change
between deployments.
Correct configuration separation also supports
production-like environments - because the same
artifact runs everywhere, the only difference between environments is the injected
configuration, which is itself version controlled and automated.
When configuration is externalized correctly, rollback becomes
straightforward: deploy the previous artifact with the appropriate configuration, and the
system returns to its prior state.
This content is adapted from MinimumCD.org,
licensed under CC BY 4.0.
3.3.6 - Production-Like Environments
Test in environments that match production to catch environment-specific issues early.
Phase 2 - Pipeline | Adapted from MinimumCD.org
Definition
Production-like environments are pre-production environments that mirror the
infrastructure, configuration, and behavior of production closely enough that passing
tests in these environments provides genuine confidence that the change will work in
production.
“Production-like” does not mean “identical to production” in every dimension. It means
that the aspects of the environment relevant to the tests being run match production
sufficiently to produce a valid signal. A unit test environment needs the right runtime
version. An integration test environment needs the right service topology. A staging
environment needs the right infrastructure, networking, and data characteristics.
Why It Matters for CD Migration
The gap between pre-production environments and production is where deployment failures
hide. Teams that test in environments that differ significantly from production - in
operating system, database version, network topology, resource constraints, or
configuration - routinely discover issues only after deployment.
For a CD migration, production-like environments are what transform pre-production testing
from “we hope this works” to “we know this works.” They close the gap between the
pipeline’s quality signal and the reality of production, making it safe to deploy
automatically.
Key Principles
Staging reflects production infrastructure
Your staging environment should match production in the dimensions that affect application
behavior:
- Infrastructure platform - same cloud provider, same orchestrator, same service mesh
- Network topology - same load balancer configuration, same DNS resolution patterns,
same firewall rules
- Database engine and version - same database type, same version, same configuration
parameters
- Operating system and runtime - same OS distribution, same runtime version, same
system libraries
- Service dependencies - same versions of downstream services, or accurate test doubles
Staging does not necessarily need the same scale as production (fewer replicas, smaller
instances), but the architecture must be the same.
Environments are version controlled
Every aspect of the environment that can be defined in code must be:
- Infrastructure definitions - Terraform, CloudFormation, Pulumi, or equivalent
- Configuration - Kubernetes manifests, Helm charts, Ansible playbooks
- Network policies - security groups, firewall rules, service mesh configuration
- Monitoring and alerting - the same observability configuration in all environments
Version-controlled environments can be reproduced, compared, and audited. Manual
environment configuration cannot.
Ephemeral environments
Ephemeral environments are full-stack, on-demand, short-lived environments spun up for a
specific purpose - a pull request, a test run, a demo - and destroyed when that purpose is
complete.
Key characteristics of ephemeral environments:
- Full-stack - they include the application and all of its dependencies (databases,
message queues, caches, downstream services), not just the application in isolation
- On-demand - any developer or pipeline can spin one up at any time without waiting
for a shared resource
- Short-lived - they exist for hours or days, not weeks or months. This prevents
configuration drift and stale state
- Version controlled - the environment definition is in code, and the environment is
created from a specific version of that code
- Isolated - they do not share resources with other environments. No shared databases,
no shared queues, no shared service instances
Ephemeral environments eliminate the “shared staging” bottleneck where multiple teams
compete for a single pre-production environment and block each other’s progress.
Data is representative
The data in pre-production environments must be representative of production data in
structure, volume, and characteristics. This does not mean using production data directly
(which raises security and privacy concerns). It means:
- Schema matches production - same tables, same columns, same constraints
- Volume is realistic - tests run against data sets large enough to reveal performance
issues
- Data characteristics are representative - edge cases, special characters,
null values, and data distributions that match what the application will encounter
- Data is anonymized - if production data is used as a seed, all personally
identifiable information is removed or masked
Anti-Patterns
Shared, long-lived staging environments
A single staging environment shared by multiple teams becomes a bottleneck and a source of
conflicts. Teams overwrite each other’s changes, queue up for access, and encounter
failures caused by other teams’ work. Long-lived environments also drift from production
as manual changes accumulate.
Environments that differ from production in critical ways
Running a different database version in staging than production, using a different
operating system, or skipping the load balancer that exists in production creates blind
spots where issues hide until they reach production.
“It works on my laptop” as validation
Developer laptops are the least production-like environment available. They have different
operating systems, different resource constraints, different network characteristics, and
different installed software. Local validation is valuable for fast feedback during
development, but it does not replace testing in a production-like environment.
Manual environment provisioning
Environments created by manually clicking through cloud consoles, running ad-hoc scripts,
or following runbooks are unreproducible and drift over time. If you cannot destroy and
recreate the environment from code in minutes, it is not suitable for continuous delivery.
Synthetic-only test data
Using only hand-crafted test data with a few happy-path records misses the issues that
emerge with production-scale data: slow queries, missing indexes, encoding problems, and
edge cases that only appear in real-world data distributions.
Good Patterns
Infrastructure as Code for all environments
Define every environment - from local development to production - using the same
Infrastructure as Code templates. The differences between environments are captured in
configuration variables (instance sizes, replica counts, domain names), not in different
templates.
Environment-per-pull-request
Automatically provision a full-stack ephemeral environment for every pull request. Run the
full test suite against this environment. Tear it down when the pull request is merged or
closed. This provides isolated, production-like validation for every change.
Production data sampling and anonymization
Build an automated pipeline that samples production data, anonymizes it (removing PII,
masking sensitive fields), and loads it into pre-production environments. This provides
realistic data without security or privacy risks.
Service virtualization for external dependencies
For external dependencies that cannot be replicated in pre-production (third-party APIs,
partner systems), use service virtualization to create realistic test doubles that mimic
the behavior, latency, and error modes of the real service.
Environment parity monitoring
Continuously compare pre-production environments against production to detect drift.
Alert when the infrastructure, configuration, or service versions diverge. Tools that
compare Terraform state, Kubernetes configurations, or cloud resource inventories can
automate this comparison.
Namespaced environments in shared clusters
In Kubernetes or similar platforms, use namespaces to create isolated environments within
a shared cluster. Each namespace gets its own set of services, databases, and
configuration, providing isolation without the cost of separate clusters.
How to Get Started
Step 1: Audit environment parity
Compare your current pre-production environments against production across every relevant
dimension: infrastructure, configuration, data, service versions, network topology. List
every difference.
Step 2: Infrastructure-as-Code your environments
If your environments are not yet defined in code, start here. Define your production
environment in Terraform, CloudFormation, or equivalent. Then create pre-production
environments from the same definitions with different parameter values.
Step 3: Address the highest-risk parity gaps
From your audit, identify the differences most likely to cause production failures -
typically database version mismatches, missing infrastructure components, or network
configuration differences. Fix these first.
Step 4: Implement ephemeral environments
Build the tooling to spin up and tear down full-stack environments on demand. Start with
a simplified version (perhaps without full data replication) and iterate toward full
production parity.
Step 5: Automate data provisioning
Create an automated pipeline for generating or sampling representative test data. Include
anonymization, schema validation, and data refresh on a regular schedule.
Step 6: Monitor and maintain parity
Set up automated checks that compare pre-production environments to production and alert
on drift. Make parity a continuous concern, not a one-time setup.
Connection to the Pipeline Phase
Production-like environments are where the pipeline’s quality gates run. Without
production-like environments, the deployable definition
produces a false signal - tests pass in an environment that does not resemble production,
and failures appear only after deployment.
Immutable artifacts flow through these environments unchanged,
with only configuration varying. This combination - same
artifact, production-like environment, environment-specific configuration - is what gives
the pipeline its predictive power.
Production-like environments also support effective rollback testing: you
can validate that a rollback works correctly in a staging environment before relying on it
in production.
This content is adapted from MinimumCD.org,
licensed under CC BY 4.0.
3.3.7 - Pipeline Architecture
Design efficient quality gates for your delivery system’s context.
Phase 2 - Pipeline | Adapted from Dojo Consortium
Definition
Pipeline architecture is the structural design of your delivery pipeline - how stages are
organized, how quality gates are sequenced, how feedback loops operate, and how the
pipeline evolves over time. It encompasses both the technical design of the pipeline and
the improvement journey that a team follows from an initial, fragile pipeline to a mature,
resilient delivery system.
Good pipeline architecture is not achieved in a single step. Teams progress through
recognizable states, applying the Theory of Constraints to systematically identify and
resolve bottlenecks. The goal is a loosely coupled architecture where independent services
can be built, tested, and deployed independently through their own pipelines.
Why It Matters for CD Migration
Most teams beginning a CD migration have a pipeline that is somewhere between “barely
functional” and “works most of the time.” The pipeline may be slow, fragile, or tightly
coupled to other systems. Improving it requires a deliberate architectural approach - not
just adding more stages or more tests, but designing the pipeline for the flow
characteristics that continuous delivery demands.
Understanding where your pipeline architecture currently stands, and what the next
improvement looks like, prevents teams from either stalling at a “good enough” state or
attempting to jump directly to a target state that their context cannot support.
Three Architecture States
Teams typically progress through three recognizable states on their journey to mature
pipeline architecture. Understanding which state you are in determines what improvements
to prioritize.
In the entangled state, the pipeline has significant structural problems that prevent
reliable delivery:
- Multiple applications share a single pipeline - a change to one application triggers
builds and tests for all applications, causing unnecessary delays and false failures
- Shared, mutable infrastructure - pipeline stages depend on shared databases, shared
environments, or shared services that introduce coupling and contention
- Manual stages interrupt automated flow - manual approval gates, manual test
execution, or manual environment provisioning block the pipeline for hours or days
- No clear ownership - the pipeline is maintained by a central team, and application
teams cannot modify it without filing tickets and waiting
- Build times measured in hours - the pipeline is so slow that developers batch
changes and avoid running it
- Flaky tests are accepted - the team routinely re-runs failed pipelines, and failures
are assumed to be transient
Remediation priorities:
- Separate pipelines for separate applications
- Remove manual stages or parallelize them out of the critical path
- Fix or remove flaky tests
- Establish clear pipeline ownership with the application team
Tightly Coupled (Transitional)
In the tightly coupled state, each application has its own pipeline, but pipelines depend
on each other or on shared resources:
- Integration tests span multiple services - a pipeline for service A runs integration
tests that require service B, C, and D to be deployed in a specific state
- Shared test environments - multiple pipelines deploy to the same staging environment,
creating contention and sequencing constraints
- Coordinated deployments - deploying service A requires simultaneously deploying
service B, which requires coordinating two pipelines
- Shared build infrastructure - pipelines compete for limited build agent capacity,
causing queuing delays
- Pipeline definitions are centralized - a shared pipeline library controls the
structure, and application teams cannot customize it for their needs
Improvement priorities:
- Replace cross-service integration tests with contract tests
- Implement ephemeral environments to eliminate shared environment contention
- Decouple service deployments using backward-compatible changes and feature flags
- Give teams ownership of their pipeline definitions
- Scale build infrastructure to eliminate queuing
Loosely Coupled (Goal)
In the loosely coupled state, each service has an independent pipeline that can build,
test, and deploy without depending on other services’ pipelines:
- Independent deployability - any service can be deployed at any time without
coordinating with other teams
- Contract-based integration - services verify their interactions through contract
tests, not cross-service integration tests
- Ephemeral, isolated environments - each pipeline creates its own test environment
and tears it down when done
- Team-owned pipelines - each team controls their pipeline definition and can optimize
it for their service’s needs
- Fast feedback - the pipeline completes in minutes, providing rapid feedback to
developers
- Self-service infrastructure - teams provision their own pipeline infrastructure
without waiting for a central team
Applying the Theory of Constraints
Pipeline improvement follows the Theory of Constraints: identify the single biggest
bottleneck, resolve it, and repeat. The key steps:
Step 1: Identify the constraint
Measure where time is spent in the pipeline. Common constraints include:
- Slow test suites - tests that take 30+ minutes dominate the pipeline duration
- Queuing for shared resources - pipelines waiting for build agents, shared
environments, or manual approvals
- Flaky failures and re-runs - time lost to investigating and re-running non-deterministic
failures
- Large batch sizes - pipelines triggered by large, infrequent commits that take
longer to build and are harder to debug when they fail
Step 2: Exploit the constraint
Get the maximum throughput from the current constraint without changing the architecture:
- Parallelize test execution across multiple agents
- Cache dependencies to speed up the build stage
- Prioritize pipeline runs (trunk commits before branch builds)
- Deduplicate unnecessary work (skip unchanged modules)
Step 3: Subordinate everything else to the constraint
Ensure that other parts of the system do not overwhelm the constraint:
- If the test stage is the bottleneck, do not add more tests without first making
existing tests faster
- If the build stage is the bottleneck, do not add more build steps without first
optimizing the build
Step 4: Elevate the constraint
If exploiting the constraint is not sufficient, invest in removing it:
- Rewrite slow tests to be faster
- Replace shared environments with ephemeral environments
- Replace manual gates with automated checks
- Split monolithic pipelines into independent service pipelines
Step 5: Repeat
Once a constraint is resolved, a new constraint will emerge. This is expected. The
pipeline improves through continuous iteration, not through a single redesign.
Key Design Principles
Fast feedback first
Organize pipeline stages so that the fastest checks run first. A developer should know
within minutes if their change has an obvious problem (compilation failure, linting error,
unit test failure). Slower checks (integration tests, security scans, performance tests)
run after the fast checks pass.
Fail fast, fail clearly
When the pipeline fails, it should fail as early as possible and produce a clear, actionable
error message. A developer should be able to read the failure output and know exactly what
to fix without digging through logs.
Parallelize where possible
Stages that do not depend on each other should run in parallel. Security scans can run
alongside integration tests. Linting can run alongside compilation. Parallelization is the
most effective way to reduce pipeline duration without removing checks.
Pipeline as code
The pipeline definition lives in the same repository as the application it builds and
deploys. This gives the team full ownership and allows the pipeline to evolve alongside
the application.
Observability
Instrument the pipeline itself with metrics and monitoring. Track:
- Lead time - time from commit to production deployment
- Pipeline duration - time from pipeline start to completion
- Failure rate - percentage of pipeline runs that fail
- Recovery time - time from failure detection to successful re-run
- Queue time - time spent waiting before the pipeline starts
These metrics identify bottlenecks and measure improvement over time.
Anti-Patterns
The “grand redesign”
Attempting to redesign the entire pipeline at once, rather than iteratively improving the
biggest constraint, is a common failure mode. Grand redesigns take too long, introduce too
much risk, and often fail to address the actual problems.
Central pipeline teams that own all pipelines
A central team that controls all pipeline definitions creates a bottleneck. Application
teams wait for changes, cannot customize pipelines for their context, and are disconnected
from their own delivery process.
Optimizing non-constraints
Speeding up a pipeline stage that is not the bottleneck does not improve overall delivery
time. Measure before optimizing.
Monolithic pipeline for microservices
Running all microservices through a single pipeline that builds and deploys everything
together defeats the purpose of a microservice architecture. Each service should have its
own independent pipeline.
How to Get Started
Step 1: Assess your current state
Determine which architecture state - entangled, tightly coupled, or loosely coupled -
best describes your current pipeline. Be honest about where you are.
Step 2: Measure your pipeline
Instrument your pipeline to measure duration, failure rates, queue times, and
bottlenecks. You cannot improve what you do not measure.
Step 3: Identify the top constraint
Using your measurements, identify the single biggest bottleneck in your pipeline. This is
where you focus first.
Step 4: Apply the Theory of Constraints cycle
Exploit, subordinate, and if necessary elevate the constraint. Then measure again and
identify the next constraint.
Step 5: Evolve toward loose coupling
With each improvement cycle, move toward independent, team-owned pipelines that can
build, test, and deploy services independently. This is a journey of months or years,
not days.
Connection to the Pipeline Phase
Pipeline architecture is where all the other practices in this phase come together. The
single path to production defines the route. The
deterministic pipeline ensures reliability. The
deployable definition defines the quality gates. The
architecture determines how these elements are organized, sequenced, and optimized for
flow.
As teams mature their pipeline architecture toward loose coupling, they build the
foundation for Phase 3: Optimize - where the focus shifts from building
the pipeline to improving its speed and reliability.
This content is adapted from the Dojo Consortium,
licensed under CC BY 4.0.
3.3.8 - Rollback
Enable fast recovery from any deployment by maintaining the ability to roll back.
Phase 2 - Pipeline | Adapted from MinimumCD.org
Definition
Rollback is the ability to quickly and safely revert a production deployment to a previous
known-good state. It is the safety net that makes continuous delivery possible: because you
can always undo a deployment, deploying becomes a low-risk, routine operation.
Rollback is not a backup plan for when things go catastrophically wrong. It is a standard
operational capability that should be exercised regularly and trusted completely. Every
deployment to production should be accompanied by a tested, automated, fast rollback
mechanism.
Why It Matters for CD Migration
Fear of deployment is the single biggest cultural barrier to continuous delivery. Teams
that have experienced painful, irreversible deployments develop a natural aversion to
deploying frequently. They batch changes, delay releases, and add manual approval gates -
all of which slow delivery and increase risk.
Reliable, fast rollback breaks this cycle. When the team knows that any deployment can be
reversed in minutes, the perceived risk of deployment drops dramatically. Smaller, more
frequent deployments become possible. The feedback loop tightens. The entire delivery
system improves.
Key Principles
Fast
Rollback must complete in minutes, not hours. A rollback that takes an hour to execute
is not a rollback - it is a prolonged outage with a recovery plan. Target rollback times
of 5 minutes or less for the deployment mechanism itself. If the previous artifact is
already in the artifact repository and the deployment mechanism is automated, there is
no reason rollback should take longer than a fresh deployment.
Automated
Rollback must be a single command or a single click - or better, fully automated based
on health checks. It should not require:
- SSH access to production servers
- Manual editing of configuration files
- Running scripts with environment-specific parameters from memory
- Coordinating multiple teams to roll back multiple services simultaneously
Safe
Rollback must not make things worse. This means:
- Rolling back must not lose data
- Rolling back must not corrupt state
- Rolling back must not break other services that depend on the rolled-back service
- Rolling back must not require downtime beyond what the deployment mechanism itself imposes
Simple
The rollback procedure should be understandable by any team member, including those who
did not perform the original deployment. It should not require specialized knowledge, deep
system understanding, or heroic troubleshooting.
Tested
Rollback must be tested regularly, not just documented. A rollback procedure that has
never been exercised is a rollback procedure that will fail when you need it most. Include
rollback verification in your deployable definition and
practice rollback as part of routine deployment validation.
Rollback Strategies
Blue-Green Deployment
Maintain two identical production environments - blue and green. At any time, one is live
(serving traffic) and the other is idle. To deploy, deploy to the idle environment, verify
it, and switch traffic. To roll back, switch traffic back to the previous environment.
Advantages:
- Rollback is instantaneous - just a traffic switch
- The previous version remains running and warm
- Zero-downtime deployment and rollback
Considerations:
- Requires double the infrastructure (though the idle environment can be scaled down)
- Database changes must be backward-compatible across both versions
- Session state must be externalized so it survives the switch
Canary Deployment
Deploy the new version to a small subset of production infrastructure (the “canary”) and
route a percentage of traffic to it. Monitor the canary for errors, latency, and business
metrics. If the canary is healthy, gradually increase traffic. If problems appear, route
all traffic back to the previous version.
Advantages:
- Limits blast radius - problems affect only a fraction of users
- Provides real production data for validation before full rollout
- Rollback is fast - stop sending traffic to the canary
Considerations:
- Requires traffic routing infrastructure (service mesh, load balancer configuration)
- Both versions must be able to run simultaneously
- Monitoring must be sophisticated enough to detect subtle problems in the canary
Feature Flag Rollback
When a deployment introduces new behavior behind a feature flag, rollback can be as
simple as turning off the flag. The code remains deployed, but the new behavior is
disabled. This is the fastest possible rollback - it requires no deployment at all.
Advantages:
- Instantaneous - no deployment, no traffic switch
- Granular - roll back a single feature without affecting other changes
- No infrastructure changes required
Considerations:
- Requires a feature flag system with runtime toggle capability
- Only works for changes that are behind flags
- Feature flag debt (old flags that are never cleaned up) must be managed
Database-Safe Rollback with Expand-Contract
Database schema changes are the most common obstacle to rollback. If a deployment changes
the database schema, rolling back the application code may fail if the old code is
incompatible with the new schema.
The expand-contract pattern (also called parallel change) solves this:
- Expand - add new columns, tables, or structures alongside the existing ones. The
old application code continues to work. Deploy this change.
- Migrate - update the application to write to both old and new structures, and read
from the new structure. Deploy this change. Backfill historical data.
- Contract - once all application versions using the old structure are retired, remove
the old columns or tables. Deploy this change.
At every step, the previous application version remains compatible with the current
database schema. Rollback is always safe.
Anti-pattern: Destructive schema changes (dropping columns, renaming tables,
changing types) deployed simultaneously with the application code change that requires
them. This makes rollback impossible because the old code cannot work with the new schema.
Anti-Patterns
“We’ll fix forward”
Relying exclusively on fixing forward (deploying a new fix rather than rolling back) is
dangerous when the system is actively degraded. Fix-forward should be an option when
the issue is well-understood and the fix is quick. Rollback should be the default when
the issue is unclear or the fix will take time. Both capabilities must exist.
Rollback as a documented procedure only
A rollback procedure that exists only in a runbook, wiki, or someone’s memory is not a
reliable rollback capability. Procedures that are not automated and regularly tested will
fail under the pressure of a production incident.
Coupled service rollbacks
When rolling back service A requires simultaneously rolling back services B and C, you
do not have independent rollback capability. Design services to be backward-compatible
so that each service can be rolled back independently.
Destructive database migrations
Schema changes that destroy data or break backward compatibility make rollback impossible.
Always use the expand-contract pattern for schema changes.
Manual rollback requiring specialized knowledge
If only one person on the team knows how to perform a rollback, the team does not have a
rollback capability - it has a single point of failure. Rollback must be simple enough
for any team member to execute.
Good Patterns
Automated rollback on health check failure
Configure the deployment system to automatically roll back if the new version fails
health checks within a defined window after deployment. This removes the need for a human
to detect the problem and initiate the rollback.
Rollback testing in staging
As part of every deployment to staging, deploy the new version, verify it, then roll it
back and verify the rollback. This ensures that rollback works for every release, not
just in theory.
Artifact retention
Retain previous artifact versions in the artifact repository so that rollback is always
possible. Define a retention policy (for example, keep the last 10 production-deployed
versions) and ensure that rollback targets are always available.
Deployment log and audit trail
Maintain a clear record of what is currently deployed, what was previously deployed, and
when changes occurred. This makes it easy to identify the correct rollback target and
verify that the rollback was successful.
Rollback runbook exercises
Regularly practice rollback as a team exercise - not just as part of automated testing,
but as a deliberate drill. This builds team confidence and identifies gaps in the process.
How to Get Started
Step 1: Document your current rollback capability
Can you roll back your current production deployment right now? How long would it take?
Who would need to be involved? What could go wrong? Be honest about the answers.
Step 2: Implement a basic automated rollback
Start with the simplest mechanism available for your deployment platform - redeploying the
previous container image, switching a load balancer target, or reverting a Kubernetes
deployment. Automate this as a single command.
Step 3: Test the rollback
Deploy a change to staging, then roll it back. Verify that the system returns to its
previous state. Make this a standard part of your deployment validation.
Step 4: Address database compatibility
Audit your database migration practices. If you are making destructive schema changes,
shift to the expand-contract pattern. Ensure that the previous application version is
always compatible with the current database schema.
Step 5: Reduce rollback time
Measure how long rollback takes. Identify and eliminate delays - slow artifact downloads,
slow startup times, manual steps. Target rollback completion in under 5 minutes.
Step 6: Build team confidence
Practice rollback regularly. Demonstrate it during deployment reviews. Make it a normal
part of operations, not an emergency procedure. When the team trusts rollback, they will
trust deployment.
Connection to the Pipeline Phase
Rollback is the capstone of the Pipeline phase. It is what makes the rest of the phase
safe:
With rollback in place, the team has the confidence to deploy frequently, which is the
foundation for Phase 3: Optimize and ultimately
Phase 4: Deliver on Demand.
This content is adapted from MinimumCD.org,
licensed under CC BY 4.0.
3.4 - Phase 3: Optimize
Improve flow by reducing batch size, limiting work in progress, and using metrics to drive improvement.
Key question: “Can we deliver small changes quickly?”
With a working pipeline in place, this phase focuses on optimizing the flow of changes
through it. Smaller batches, feature flags, and WIP limits reduce risk and increase
delivery frequency.
What You’ll Do
- Reduce batch size - Deliver smaller, more frequent changes
- Use feature flags - Decouple deployment from release
- Limit work in progress - Focus on finishing over starting
- Drive improvement with metrics - Use DORA metrics and improvement kata
- Run effective retrospectives - Continuously improve the delivery process
- Decouple architecture - Enable independent deployment of components
Why This Phase Matters
Having a pipeline isn’t enough - you need to optimize the flow through it. Teams that
deploy weekly with a CD pipeline are missing most of the benefits. Small batches reduce
risk, feature flags enable testing in production, and metrics-driven improvement creates
a virtuous cycle of getting better at getting better.
When You’re Ready to Move On
You’re ready for Phase 4: Deliver on Demand when:
- Most changes are small enough to deploy independently
- Feature flags let you deploy incomplete features safely
- Your WIP limits keep work flowing without bottlenecks
- You’re measuring and improving your DORA metrics regularly
3.4.1 - Small Batches
Deliver smaller, more frequent changes to reduce risk and increase feedback speed.
Phase 3 - Optimize | Adapted from MinimumCD.org
Batch size is the single biggest lever for improving delivery performance. This page covers what batch size means at every level - deploy frequency, commit size, and story size - and provides concrete techniques for reducing it.
Why Batch Size Matters
Large batches create large risks. When you deploy 50 changes at once, any failure could be caused by any of those 50 changes. When you deploy 1 change, the cause of any failure is obvious.
This is not a theory. The DORA research consistently shows that elite teams deploy more frequently, with smaller changes, and have both higher throughput and lower failure rates. Small batches are the mechanism that makes this possible.
“If it hurts, do it more often, and bring the pain forward.”
- Jez Humble, Continuous Delivery
Three Levels of Batch Size
Batch size is not just about deployments. It operates at three distinct levels, and optimizing only one while ignoring the others limits your improvement.
Level 1: Deploy Frequency
How often you push changes to production.
| State |
Deploy Frequency |
Risk Profile |
| Starting |
Monthly or quarterly |
Each deploy is a high-stakes event |
| Improving |
Weekly |
Deploys are planned but routine |
| Optimizing |
Daily |
Deploys are unremarkable |
| Elite |
Multiple times per day |
Deploys are invisible |
How to reduce: Remove manual gates, automate approval workflows, build confidence through progressive rollout. If your pipeline is reliable (Phase 2), the only thing preventing more frequent deploys is organizational habit.
Level 2: Commit Size
How much code changes in each commit to trunk.
| Indicator |
Too Large |
Right-Sized |
| Files changed |
20+ files |
1-5 files |
| Lines changed |
500+ lines |
Under 100 lines |
| Review time |
Hours or days |
Minutes |
| Merge conflicts |
Frequent |
Rare |
| Description length |
Paragraph needed |
One sentence suffices |
How to reduce: Practice TDD (write one test, make it pass, commit). Use feature flags to merge incomplete work. Pair program so review happens in real time.
Level 3: Story Size
How much scope each user story or work item contains.
A story that takes a week to complete is a large batch. It means a week of work piles up before integration, a week of assumptions go untested, and a week of inventory sits in progress.
Target: Every story should be completable - coded, tested, reviewed, and integrated - in two days or less. If it cannot be, it needs to be decomposed further.
Behavior-Driven Development for Decomposition
BDD provides a concrete technique for breaking stories into small, testable increments. The Given-When-Then format forces clarity about scope.
The Given-When-Then Pattern
Each scenario becomes a deliverable increment. You can implement and deploy the first scenario before starting the second. This is how you turn a “discount feature” (large batch) into three independent, deployable changes (small batches).
Decomposing Stories Using Scenarios
When a story has too many scenarios, it is too large. Use this process:
- Write all the scenarios first. Before any code, enumerate every Given-When-Then for the story.
- Group scenarios into deliverable slices. Each slice should be independently valuable or at least independently deployable.
- Create one story per slice. Each story has 1-3 scenarios and can be completed in 1-2 days.
- Order the slices by value. Deliver the most important behavior first.
Example decomposition:
| Original Story |
Scenarios |
Sliced Into |
| “As a user, I can manage my profile” |
12 scenarios covering name, email, password, avatar, notifications, privacy, deactivation |
5 stories: basic info (2 scenarios), password (2), avatar (2), notifications (3), deactivation (3) |
Vertical Slicing
A vertical slice cuts through all layers of the system to deliver a thin piece of end-to-end functionality. This is the opposite of horizontal slicing, where you build all the database changes, then all the API changes, then all the UI changes.
Horizontal vs. Vertical Slicing
Horizontal (avoid):
Story 1: Build the database schema for discounts
Story 2: Build the API endpoints for discounts
Story 3: Build the UI for applying discounts
Problems: Story 1 and 2 deliver no user value. You cannot test end-to-end until story 3 is done. Integration risk accumulates.
Vertical (prefer):
Story 1: Apply a simple percentage discount (DB + API + UI for one scenario)
Story 2: Reject expired discount codes (DB + API + UI for one scenario)
Story 3: Apply discounts only to eligible items (DB + API + UI for one scenario)
Benefits: Every story delivers testable, deployable functionality. Integration happens with each story, not at the end. You can ship story 1 and get feedback before building story 2.
How to Slice Vertically
Ask these questions about each proposed story:
- Can a user (or another system) observe the change? If not, slice differently.
- Can I write an end-to-end test for it? If not, the slice is incomplete.
- Does it require all other slices to be useful? If yes, find a thinner first slice.
- Can it be deployed independently? If not, check whether feature flags could help.
Practical Steps for Reducing Batch Size
Week 1-2: Measure Current State
Before changing anything, measure where you are:
- Average commit size (lines changed per commit)
- Average story cycle time (time from start to done)
- Deploy frequency (how often changes reach production)
- Average changes per deploy (how many commits per deployment)
Week 3-4: Introduce Story Decomposition
- Start writing BDD scenarios before implementation
- Split any story estimated at more than 2 days
- Track the number of stories completed per week (expect this to increase as stories get smaller)
Week 5-8: Tighten Commit Size
- Adopt the discipline of “one logical change per commit”
- Use TDD to create a natural commit rhythm: write test, make it pass, commit
- Track average commit size and set a team target (e.g., under 100 lines)
Ongoing: Increase Deploy Frequency
- Deploy at least once per day, then work toward multiple times per day
- Remove any batch-oriented processes (e.g., “we deploy on Tuesdays”)
- Make deployment a non-event
Key Pitfalls
1. “Small stories take more overhead to manage”
This is true only if your process adds overhead per story (e.g., heavyweight estimation ceremonies, multi-level approval). The solution is to simplify the process, not to keep stories large. Overhead per story should be near zero for a well-decomposed story.
2. “Some things can’t be done in small batches”
Almost anything can be decomposed further. Database migrations can be done in backward-compatible steps. API changes can use versioning. UI changes can be hidden behind feature flags. The skill is in finding the decomposition, not in deciding whether one exists.
3. “We tried small stories but our throughput dropped”
This usually means the team is still working sequentially. Small stories require limiting WIP and swarming - see Limiting WIP. If the team starts 10 small stories instead of 2 large ones, they have not actually reduced batch size; they have increased WIP.
Measuring Success
Next Step
Small batches often require deploying incomplete features to production. Feature Flags provide the mechanism to do this safely.
This content is adapted from MinimumCD.org,
licensed under CC BY 4.0.
3.4.2 - Feature Flags
Decouple deployment from release by using feature flags to control feature visibility.
Phase 3 - Optimize | Adapted from MinimumCD.org
Feature flags are the mechanism that makes trunk-based development and small batches safe. They let you deploy code to production without exposing it to users, enabling dark launches, gradual rollouts, and instant rollback of features without redeploying.
Why Feature Flags?
In continuous delivery, deployment and release are two separate events:
- Deployment is pushing code to production.
- Release is making a feature available to users.
Feature flags are the bridge between these two events. They let you deploy frequently (even multiple times a day) without worrying about exposing incomplete or untested features. This separation is what makes continuous deployment possible for teams that ship real products to real users.
When You Need Feature Flags (and When You Don’t)
Not every change requires a feature flag. Flags add complexity, and unnecessary complexity slows you down. Use this decision tree to determine the right approach.
Decision Tree
Is the change user-visible?
├── No → Deploy without a flag
│ (refactoring, performance improvements, dependency updates)
│
└── Yes → Can it be completed and deployed in a single small batch?
├── Yes → Deploy without a flag
│ (bug fixes, copy changes, small UI tweaks)
│
└── No → Is there a seam in the code where you can introduce the change?
├── Yes → Consider Branch by Abstraction
│ (replacing a subsystem, swapping an implementation)
│
└── No → Is it a new feature with a clear entry point?
├── Yes → Use a Feature Flag
│
└── No → Consider Connect Tests Last
(build the internals first, wire them up last)
Alternatives to Feature Flags
| Technique |
How It Works |
When to Use |
| Branch by Abstraction |
Introduce an abstraction layer, build the new implementation behind it, switch when ready |
Replacing an existing subsystem or library |
| Connect Tests Last |
Build internal components without connecting them to the UI or API |
New backend functionality that has no user-facing impact until connected |
| Dark Launch |
Deploy the code path but do not route any traffic to it |
New infrastructure, new services, or new endpoints that are not yet referenced |
These alternatives avoid the lifecycle overhead of feature flags while still enabling trunk-based development with incomplete work.
Implementation Approaches
Feature flags can be implemented at different levels of sophistication. Start simple and add complexity only when needed.
Level 1: Static Code-Based Flags
The simplest approach: a boolean constant or configuration value checked in code.
Pros: Zero infrastructure. Easy to understand. Works everywhere.
Cons: Changing a flag requires a deployment. No per-user targeting. No gradual rollout.
Best for: Teams starting out. Internal tools. Changes that will be fully on or fully off.
Level 2: Dynamic In-Process Flags
Flags stored in a configuration file, database, or environment variable that can be changed at runtime without redeploying.
Pros: No redeployment needed. Supports percentage rollout. Simple to implement.
Cons: Each instance reads its own config - no centralized view. Limited targeting capabilities.
Best for: Teams that need gradual rollout but do not want to adopt a third-party service yet.
Level 3: Centralized Flag Service
A dedicated service (self-hosted or SaaS) that manages all flags, provides a dashboard, supports targeting rules, and tracks flag usage.
Examples: LaunchDarkly, Unleash, Flagsmith, Split, or a custom internal service.
Pros: Centralized management. Rich targeting (by user, plan, region, etc.). Audit trail. Real-time changes.
Cons: Added dependency. Cost (for SaaS). Network latency for flag evaluation (mitigated by local caching in most SDKs).
Best for: Teams at scale. Products with diverse user segments. Regulated environments needing audit trails.
Level 4: Infrastructure Routing
Instead of checking flags in application code, route traffic at the infrastructure level (load balancer, service mesh, API gateway).
Pros: No application code changes. Clean separation of routing from logic. Works across services.
Cons: Requires infrastructure investment. Less granular than application-level flags. Harder to target individual users.
Best for: Microservice architectures. Service-level rollouts. A/B testing at the infrastructure layer.
Feature Flag Lifecycle
Every feature flag has a lifecycle. Flags that are not actively managed become technical debt. Follow this lifecycle rigorously.
The Six Stages
1. CREATE → Define the flag, document its purpose and owner
2. DEPLOY OFF → Code ships to production with the flag disabled
3. BUILD → Incrementally add functionality behind the flag
4. DARK LAUNCH → Enable for internal users or a small test group
5. ROLLOUT → Gradually increase the percentage of users
6. REMOVE → Delete the flag and the old code path
Stage 1: Create
Before writing any code, define the flag:
- Name: Use a consistent naming convention (e.g.,
enable-new-checkout, feature.discount-engine)
- Owner: Who is responsible for this flag through its lifecycle?
- Purpose: One sentence describing what the flag controls
- Planned removal date: Set this at creation time. Flags without removal dates become permanent.
Stage 2: Deploy OFF
The first deployment includes the flag check but the flag is disabled. This verifies that:
- The flag infrastructure works
- The default (off) path is unaffected
- The flag check does not introduce performance issues
Stage 3: Build Incrementally
Continue building the feature behind the flag over multiple deploys. Each deploy adds more functionality, but the flag remains off for users. Test both paths in your automated suite:
Stage 4: Dark Launch
Enable the flag for internal users or a specific test group. This is your first validation with real production data and real traffic patterns. Monitor:
- Error rates for the flagged group vs. control
- Performance metrics (latency, throughput)
- Business metrics (conversion, engagement)
Stage 5: Gradual Rollout
Increase exposure systematically:
| Step |
Audience |
Duration |
What to Watch |
| 1 |
1% of users |
1-2 hours |
Error rates, latency |
| 2 |
5% of users |
4-8 hours |
Performance at slightly higher load |
| 3 |
25% of users |
1 day |
Business metrics begin to be meaningful |
| 4 |
50% of users |
1-2 days |
Statistically significant business impact |
| 5 |
100% of users |
- |
Full rollout |
At any step, if metrics degrade, roll back by disabling the flag. No redeployment needed.
Stage 6: Remove
This is the most commonly skipped step, and skipping it creates significant technical debt.
Once the feature has been stable at 100% for an agreed period (e.g., 2 weeks):
- Remove the flag check from code
- Remove the old code path
- Remove the flag definition from the flag service
- Deploy the simplified code
Set a maximum flag lifetime. A common practice is 90 days. Any flag older than 90 days triggers an automatic review. Stale flags are a maintenance burden and a source of confusion.
Key Pitfalls
1. “We have 200 feature flags and nobody knows what they all do”
This is flag debt, and it is as damaging as any other technical debt. Prevent it by enforcing the lifecycle: every flag has an owner, a purpose, and a removal date. Run a monthly flag audit.
2. “We use flags for everything, including configuration”
Feature flags and configuration are different concerns. Flags are temporary (they control unreleased features). Configuration is permanent (it controls operational behavior like timeouts, connection pools, log levels). Mixing them leads to confusion about what can be safely removed.
3. “Testing both paths doubles our test burden”
It does increase test effort, but this is a temporary cost. When the flag is removed, the extra tests go away too. The alternative - deploying untested code paths - is far more expensive.
4. “Nested flags create combinatorial complexity”
Avoid nesting flags whenever possible. If feature B depends on feature A, do not create a separate flag for B. Instead, extend the behavior behind feature A’s flag. If you must nest, document the dependency and test the specific combinations that matter.
Measuring Success
| Metric |
Target |
Why It Matters |
| Active flag count |
Stable or decreasing |
Confirms flags are being removed, not accumulating |
| Average flag age |
< 90 days |
Catches stale flags before they become permanent |
| Flag-related incidents |
Near zero |
Confirms flag management is not causing problems |
| Time from deploy to release |
Hours to days (not weeks) |
Confirms flags enable fast, controlled releases |
Next Step
Small batches and feature flags let you deploy more frequently, but deploying more means more work in progress. Limiting WIP ensures that increased deploy frequency does not create chaos.
This content is adapted from MinimumCD.org,
licensed under CC BY 4.0.
3.4.3 - Limiting Work in Progress
Focus on finishing work over starting new work to improve flow and reduce cycle time.
Phase 3 - Optimize | Adapted from Dojo Consortium
Work in progress (WIP) is inventory. Like physical inventory, it loses value the longer it sits unfinished. Limiting WIP is the most counterintuitive and most impactful practice in this entire migration: doing less work at once makes you deliver more.
Why Limiting WIP Matters
Every item of work in progress has a cost:
- Context switching: Moving between tasks destroys focus. Research consistently shows that switching between two tasks reduces productive time by 20-40%.
- Delayed feedback: Work that is started but not finished cannot be validated by users. The longer it sits, the more assumptions go untested.
- Hidden dependencies: The more items in progress simultaneously, the more likely they are to conflict, block each other, or require coordination.
- Longer cycle time: Little’s Law states that cycle time = WIP / throughput. If throughput is constant, the only way to reduce cycle time is to reduce WIP.
“Stop starting, start finishing.”
How to Set Your WIP Limit
The N+2 Starting Point
A practical starting WIP limit for a team is N+2, where N is the number of team members actively working on delivery.
| Team Size |
Starting WIP Limit |
Rationale |
| 3 developers |
5 items |
Allows one item per person plus a small buffer |
| 5 developers |
7 items |
Same principle at larger scale |
| 8 developers |
10 items |
Buffer becomes proportionally smaller |
Why N+2 and not N? Because some items will be blocked waiting for review, testing, or external dependencies. A small buffer prevents team members from being idle when their primary task is blocked. But the buffer should be small - two items, not ten.
Continuously Lower the Limit
The N+2 formula is a starting point, not a destination. Once the team is comfortable with the initial limit, reduce it:
- Start at N+2. Run for 2-4 weeks. Observe where work gets stuck.
- Reduce to N+1. Tighten the limit. Some team members will occasionally be “idle” - this is a feature, not a bug. They should swarm on blocked items.
- Reduce to N. At this point, every team member is working on exactly one thing. Blocked work gets immediate attention because someone is always available to help.
- Consider going below N. Some teams find that pairing (two people, one item) further reduces cycle time. A team of 6 with a WIP limit of 3 means everyone is pairing.
Each reduction will feel uncomfortable. That discomfort is the point - it exposes problems in your workflow that were previously hidden by excess WIP.
What Happens When You Hit the Limit
When the team reaches its WIP limit and someone finishes a task, they have two options:
- Pull the next highest-priority item (if the WIP limit allows it).
- Swarm on an existing item that is blocked, stuck, or nearing its cycle time target.
When the WIP limit is reached and no items are complete:
- Do not start new work. This is the hardest part and the most important.
- Help unblock existing work. Pair with someone. Review a pull request. Write a missing test. Talk to the person who has the answer to the blocking question.
- Improve the process. If nothing is blocked but everything is slow, this is the time to work on automation, tooling, or documentation.
Swarming
Swarming is the practice of multiple team members working together on a single item to get it finished faster. It is the natural complement to WIP limits.
When to Swarm
- An item has been in progress for longer than the team’s cycle time target (e.g., more than 2 days)
- An item is blocked and the blocker can be resolved by another team member
- The WIP limit is reached and someone needs work to do
- A critical defect needs to be fixed immediately
How to Swarm Effectively
| Approach |
How It Works |
Best For |
| Pair programming |
Two developers work on the same item at the same machine |
Complex logic, knowledge transfer, code that needs review |
| Mob programming |
The whole team works on one item together |
Critical path items, complex architectural decisions |
| Divide and conquer |
Break the item into sub-tasks and assign them |
Items that can be parallelized (e.g., frontend + backend + tests) |
| Unblock and return |
One person resolves the blocker, then hands back |
External dependencies, environment issues, access requests |
Why Teams Resist Swarming
The most common objection: “It’s inefficient to have two people on one task.” This is only true if you measure efficiency as “percentage of time each person is writing new code.” If you measure efficiency as “how quickly value reaches production,” swarming is almost always faster because it reduces handoffs, wait time, and rework.
How Limiting WIP Exposes Workflow Issues
One of the most valuable effects of WIP limits is that they make hidden problems visible. When you cannot start new work, you are forced to confront the problems that slow existing work down.
| Symptom When WIP Is Limited |
Root Cause Exposed |
| “I’m idle because my PR is waiting for review” |
Code review process is too slow |
| “I’m idle because I’m waiting for the test environment” |
Not enough environments, or environments are not self-service |
| “I’m idle because I’m waiting for the product owner to clarify requirements” |
Stories are not refined before being pulled into the sprint |
| “I’m idle because my build is broken and I can’t figure out why” |
Build is not deterministic, or test suite is flaky |
| “I’m idle because another team hasn’t finished the API I depend on” |
Architecture is too tightly coupled (see Architecture Decoupling) |
Each of these is a bottleneck that was previously invisible because the team could always start something else. With WIP limits, these bottlenecks become obvious and demand attention.
Implementing WIP Limits
Step 1: Make WIP Visible (Week 1)
Before setting limits, make current WIP visible:
- Count the number of items currently “in progress” for the team
- Write this number on the board (physical or digital) every day
- Most teams are shocked by how high it is. A team of 5 often has 15-20 items in progress.
Step 2: Set the Initial Limit (Week 2)
- Calculate N+2 for your team
- Add the limit to your board (e.g., a column header that says “In Progress (limit: 7)”)
- Agree as a team that when the limit is reached, no new work starts
Step 3: Enforce the Limit (Week 3+)
- When someone tries to pull new work and the limit is reached, the team helps them find an existing item to work on
- Track violations: how often does the team exceed the limit? What causes it?
- Discuss in retrospectives: Is the limit too high? Too low? What bottlenecks are exposed?
Step 4: Reduce the Limit (Monthly)
- Every month, consider reducing the limit by 1
- Each reduction will expose new bottlenecks - this is the intended effect
- Stop reducing when the team reaches a sustainable flow where items move from start to done predictably
Key Pitfalls
1. “We set a WIP limit but nobody enforces it”
A WIP limit that is not enforced is not a WIP limit. Enforcement requires a team agreement and a visible mechanism. If the board shows 10 items in progress and the limit is 7, the team should stop and address it immediately. This is a working agreement, not a suggestion.
2. “Developers are idle and management is uncomfortable”
This is the most common failure mode. Management sees “idle” developers and concludes WIP limits are wasteful. In reality, those “idle” developers are either swarming on existing work (which is productive) or the team has hit a genuine bottleneck that needs to be addressed. The discomfort is a signal that the system needs improvement.
3. “We have WIP limits but we also have expedite lanes for everything”
If every urgent request bypasses the WIP limit, you do not have a WIP limit. Expedite lanes should be rare - one per week at most. If everything is urgent, nothing is.
4. “We limit WIP per person but not per team”
Per-person WIP limits miss the point. The goal is to limit team WIP so that team members are incentivized to help each other. A per-person limit of 1 with no team limit still allows the team to have 8 items in progress simultaneously with no swarming.
Measuring Success
| Metric |
Target |
Why It Matters |
| Work in progress |
At or below team limit |
Confirms the limit is being respected |
| Development cycle time |
Decreasing |
Confirms that less WIP leads to faster delivery |
| Items completed per week |
Stable or increasing |
Confirms that finishing more, starting less works |
| Time items spend blocked |
Decreasing |
Confirms bottlenecks are being addressed |
Next Step
WIP limits expose problems. Metrics-Driven Improvement provides the framework for systematically addressing them.
This content is adapted from the Dojo Consortium,
licensed under CC BY 4.0.
3.4.4 - Metrics-Driven Improvement
Use DORA metrics and improvement kata to drive systematic delivery improvement.
Phase 3 - Optimize | Original content combining DORA recommendations and improvement kata
Improvement without measurement is guesswork. This page combines the DORA four key metrics with the improvement kata pattern to create a systematic, repeatable approach to getting better at delivery.
The Problem with Ad Hoc Improvement
Most teams improve accidentally. Someone reads a blog post, suggests a change at standup, and the team tries it for a week before forgetting about it. This produces sporadic, unmeasurable progress that is impossible to sustain.
Metrics-driven improvement replaces this with a disciplined cycle: measure where you are, define where you want to be, run a small experiment, measure the result, and repeat. The improvement kata provides the structure. DORA metrics provide the measures.
The Four DORA Metrics
The DORA research program (now part of Google Cloud) has identified four key metrics that predict software delivery performance. These are the metrics you should track throughout your CD migration.
1. Deployment Frequency
How often your team deploys to production.
| Performance Level |
Deployment Frequency |
| Elite |
On-demand (multiple deploys per day) |
| High |
Between once per day and once per week |
| Medium |
Between once per week and once per month |
| Low |
Between once per month and once every six months |
What it tells you: How comfortable your team and pipeline are with deploying. Low frequency usually indicates manual gates, fear of deployment, or large batch sizes.
How to measure: Count the number of successful deployments to production per unit of time. Automated deploys count. Hotfixes count. Rollbacks do not.
2. Lead Time for Changes
The time from a commit being pushed to trunk to that commit running in production.
| Performance Level |
Lead Time |
| Elite |
Less than one hour |
| High |
Between one day and one week |
| Medium |
Between one week and one month |
| Low |
Between one month and six months |
What it tells you: How efficient your pipeline is. Long lead times indicate slow builds, manual approval steps, or infrequent deployment windows.
How to measure: Record the timestamp when a commit merges to trunk and the timestamp when that commit is running in production. The difference is lead time. Track the median, not the mean (outliers distort the mean).
3. Change Failure Rate
The percentage of deployments that cause a failure in production requiring remediation (rollback, hotfix, or patch).
| Performance Level |
Change Failure Rate |
| Elite |
0-15% |
| High |
16-30% |
| Medium |
16-30% |
| Low |
46-60% |
What it tells you: How effective your testing and validation pipeline is. High failure rates indicate gaps in test coverage, insufficient pre-production validation, or overly large changes.
How to measure: Track deployments that result in a degraded service, require rollback, or need a hotfix. Divide by total deployments. A “failure” is defined by the team - typically any incident that requires immediate human intervention.
4. Mean Time to Restore (MTTR)
How long it takes to recover from a failure in production.
| Performance Level |
Time to Restore |
| Elite |
Less than one hour |
| High |
Less than one day |
| Medium |
Less than one day |
| Low |
Between one week and one month |
What it tells you: How resilient your system and team are. Long recovery times indicate manual rollback processes, poor observability, or insufficient incident response practices.
How to measure: Record the timestamp when a production failure is detected and the timestamp when service is fully restored. Track the median.
The DORA Capabilities
Behind these four metrics are 24 capabilities that the DORA research has shown to drive performance. They organize into five categories. Use this as a diagnostic tool: when a metric is lagging, look at the related capabilities to identify what to improve.
Continuous Delivery Capabilities
These directly affect your pipeline and deployment practices:
- Version control for all production artifacts
- Automated deployment processes
- Continuous integration
- Trunk-based development
- Test automation
- Test data management
- Shift-left security
- Continuous delivery (the ability to deploy at any time)
Architecture Capabilities
These affect how easily your system can be changed and deployed:
- Loosely coupled architecture
- Empowered teams that can choose their own tools
- Teams that can test, deploy, and release independently
Product and Process Capabilities
These affect how work flows through the team:
- Customer feedback loops
- Value stream visibility
- Working in small batches
- Team experimentation
Lean Management Capabilities
These affect how the organization supports delivery:
- Lightweight change approval processes
- Monitoring and observability
- Proactive notification
- WIP limits
- Visual management of workflow
Cultural Capabilities
These affect the environment in which teams operate:
- Generative organizational culture (Westrum model)
- Encouraging and supporting learning
- Collaboration within and between teams
- Job satisfaction
- Transformational leadership
For a detailed breakdown, see the DORA Capabilities reference.
The Improvement Kata
The improvement kata is a four-step pattern from lean manufacturing adapted for software delivery. It provides the structure for turning DORA measurements into concrete improvements.
Step 1: Understand the Direction
Where does your CD migration need to go?
This is already defined by the phases of this migration guide. In Phase 3, your direction is: smaller batches, faster flow, and higher confidence in every deployment.
Step 2: Grasp the Current Condition
Measure your current DORA metrics. Be honest - the point is to understand reality, not to look good.
Practical approach:
- Collect two weeks of data for all four DORA metrics
- Plot the data - do not just calculate averages. Look at the distribution.
- Identify which metric is furthest from your target
- Investigate the related capabilities to understand why
Example current condition:
| Metric |
Current |
Target |
Gap |
| Deployment frequency |
Weekly |
Daily |
5x improvement needed |
| Lead time |
3 days |
< 1 day |
Pipeline is slow or has manual gates |
| Change failure rate |
25% |
< 15% |
Test coverage or change size issue |
| MTTR |
4 hours |
< 1 hour |
Rollback is manual |
Step 3: Establish the Next Target Condition
Do not try to fix everything at once. Pick one metric and define a specific, measurable, time-bound target.
Good target: “Reduce lead time from 3 days to 1 day within the next 4 weeks.”
Bad target: “Improve our deployment pipeline.” (Too vague, no measure, no deadline.)
Step 4: Experiment Toward the Target
Design a small experiment that you believe will move the metric toward the target. Run it. Measure the result. Adjust.
The experiment format:
| Element |
Description |
| Hypothesis |
“If we [action], then [metric] will [improve/decrease] because [reason].” |
| Action |
What specifically will you change? |
| Duration |
How long will you run the experiment? (Typically 1-2 weeks) |
| Measure |
How will you know if it worked? |
| Decision criteria |
What result would cause you to keep, modify, or abandon the change? |
Example experiment:
Hypothesis: If we parallelize our integration test suite, lead time will drop from 3 days to under 2 days because 60% of lead time is spent waiting for tests to complete.
Action: Split the integration test suite into 4 parallel runners.
Duration: 2 weeks.
Measure: Median lead time for commits merged during the experiment period.
Decision criteria: Keep if lead time drops below 2 days. Modify if it drops but not enough. Abandon if it has no effect or introduces flakiness.
The Cycle Repeats
After each experiment:
- Measure the result
- Update your understanding of the current condition
- If the target is met, pick the next metric to improve
- If the target is not met, design another experiment
This creates a continuous improvement loop. Each cycle takes 1-2 weeks. Over months, the cumulative effect is dramatic.
Connecting Metrics to Action
When a metric is lagging, use this guide to identify where to focus.
Low Deployment Frequency
| Possible Cause |
Investigation |
Action |
| Manual approval gates |
Map the approval chain |
Automate or eliminate non-value-adding approvals |
| Fear of deployment |
Ask the team what they fear |
Address the specific fear (usually testing gaps) |
| Large batch size |
Measure changes per deploy |
Implement small batches practices |
| Deploy process is manual |
Time the deploy process |
Automate the deployment pipeline |
Long Lead Time
| Possible Cause |
Investigation |
Action |
| Slow builds |
Time each pipeline stage |
Optimize the slowest stage (often tests) |
| Waiting for environments |
Track environment wait time |
Implement self-service environments |
| Waiting for approval |
Track approval wait time |
Reduce approval scope or automate |
| Large changes |
Measure commit size |
Reduce batch size |
High Change Failure Rate
| Possible Cause |
Investigation |
Action |
| Insufficient test coverage |
Measure coverage by area |
Add tests for the areas that fail most |
| Tests pass but production differs |
Compare test and prod environments |
Make environments more production-like |
| Large, risky changes |
Measure change size |
Reduce batch size, use feature flags |
| Configuration drift |
Audit configuration differences |
Externalize and version configuration |
Long MTTR
| Possible Cause |
Investigation |
Action |
| Rollback is manual |
Time the rollback process |
Automate rollback |
| Hard to identify root cause |
Review recent incidents |
Improve observability and alerting |
| Hard to deploy fixes quickly |
Measure fix lead time |
Ensure pipeline supports rapid hotfix deployment |
| Dependencies fail in cascade |
Map failure domains |
Improve architecture decoupling |
Building a Metrics Dashboard
Make your DORA metrics visible to the team at all times. A dashboard on a wall monitor or a shared link is ideal.
Essential elements:
- Current values for all four DORA metrics
- Trend lines showing direction over the past 4-8 weeks
- Current target condition highlighted
- Active experiment description
Keep it simple. A spreadsheet updated weekly is better than a sophisticated dashboard that nobody maintains. The goal is visibility, not tooling sophistication.
Key Pitfalls
1. “We measure but don’t act”
Measurement without action is waste. If you collect metrics but never run experiments, you are creating overhead with no benefit. Every measurement should lead to a hypothesis. Every hypothesis should lead to an experiment.
2. “We use metrics to compare teams”
DORA metrics are for teams to improve themselves, not for management to rank teams. Using metrics for comparison creates incentives to game the numbers. Each team should own its own metrics and its own improvement targets.
3. “We try to improve all four metrics at once”
Focus on one metric at a time. Improving deployment frequency and change failure rate simultaneously often requires conflicting actions. Pick the biggest bottleneck, address it, then move to the next.
4. “We abandon experiments too quickly”
Most experiments need at least two weeks to show results. One bad day is not a reason to abandon an experiment. Set the duration up front and commit to it.
Measuring Success
| Indicator |
Target |
Why It Matters |
| Experiments per month |
2-4 |
Confirms the team is actively improving |
| Metrics trending in the right direction |
Consistent improvement over 3+ months |
Confirms experiments are having effect |
| Team can articulate current condition and target |
Everyone on the team knows |
Confirms improvement is a shared concern |
| Improvement items in backlog |
Always present |
Confirms improvement is treated as a deliverable |
Next Step
Metrics tell you what to improve. Retrospectives provide the team forum for deciding how to improve it.
3.4.5 - Retrospectives
Continuously improve the delivery process through structured reflection.
Phase 3 - Optimize | Adapted from Dojo Consortium
A retrospective is the team’s primary mechanism for turning observations into improvements. Without effective retrospectives, WIP limits expose problems that nobody addresses, metrics trend in the wrong direction with no response, and the CD migration stalls.
Why Retrospectives Matter for CD Migration
Every practice in this guide - trunk-based development, small batches, WIP limits, metrics-driven improvement - generates signals about what is working and what is not. Retrospectives are where the team processes those signals and decides what to change.
Teams that skip retrospectives or treat them as a checkbox exercise consistently stall at whatever maturity level they first reach. Teams that run effective retrospectives continuously improve, week after week, month after month.
The Five-Part Structure
An effective retrospective follows a structured format that prevents it from devolving into a venting session or a status meeting. This five-part structure ensures the team moves from observation to action.
Part 1: Review the Mission (5 minutes)
Start by reminding the team of the larger goal. In the context of a CD migration, this might be:
- “Our mission this quarter is to deploy to production at least once per day.”
- “We are working toward eliminating manual gates in our pipeline.”
- “Our goal is to reduce lead time from 3 days to under 1 day.”
This grounding prevents the retrospective from focusing on minor irritations and keeps the conversation aligned with what matters.
Part 2: Review the KPIs (10 minutes)
Present the team’s current metrics. For a CD migration, these are typically the DORA metrics plus any team-specific measures from Metrics-Driven Improvement.
| Metric |
Last Period |
This Period |
Trend |
| Deployment frequency |
3/week |
4/week |
Improving |
| Lead time (median) |
2.5 days |
2.1 days |
Improving |
| Change failure rate |
22% |
18% |
Improving |
| MTTR |
3 hours |
3.5 hours |
Declining |
| WIP (average) |
8 items |
6 items |
Improving |
Do not skip this step. Without data, the retrospective becomes a subjective debate where the loudest voice wins. With data, the conversation focuses on what the numbers show and what to do about them.
Part 3: Review Experiments (10 minutes)
Review the outcomes of any experiments the team ran since the last retrospective.
For each experiment:
- What was the hypothesis? Remind the team what you were testing.
- What happened? Present the data.
- What did you learn? Even failed experiments teach you something.
- What is the decision? Keep, modify, or abandon.
Example:
Experiment: Parallelize the integration test suite to reduce lead time.
Hypothesis: Lead time would drop from 2.5 days to under 2 days.
Result: Lead time dropped to 2.1 days. The parallelization worked, but environment setup time is now the bottleneck.
Decision: Keep the parallelization. New experiment: investigate self-service test environments.
Part 4: Check Goals (10 minutes)
Review any improvement goals or action items from the previous retrospective.
- Completed: Acknowledge and celebrate. This is important - it reinforces that improvement work matters.
- In progress: Check for blockers. Does the team need to adjust the approach?
- Not started: Why not? Was it deprioritized, blocked, or forgotten? If improvement work is consistently not started, the team is not treating improvement as a deliverable (see below).
Part 5: Open Conversation (25 minutes)
This is the core of the retrospective. The team discusses:
- What is working well that we should keep doing?
- What is not working that we should change?
- What new problems or opportunities have we noticed?
Facilitation techniques for this section:
| Technique |
How It Works |
Best For |
| Start/Stop/Continue |
Each person writes items in three categories |
Quick, structured, works with any team |
| 4Ls (Liked, Learned, Lacked, Longed For) |
Broader categories that capture emotional responses |
Teams that need to process frustration or celebrate wins |
| Timeline |
Plot events on a timeline and discuss turning points |
After a particularly eventful sprint or incident |
| Dot voting |
Everyone gets 3 votes to prioritize discussion topics |
When there are many items and limited time |
From Conversation to Commitment
The open conversation must produce concrete action items. Vague commitments like “we should communicate better” are worthless. Good action items are:
- Specific: “Add a Slack notification when the build breaks” (not “improve communication”)
- Owned: “Alex will set this up by Wednesday” (not “someone should do this”)
- Measurable: “We will know this worked if build break response time drops below 10 minutes”
- Time-bound: “We will review the result at the next retrospective”
Limit action items to 1-3 per retrospective. More than three means nothing gets done. One well-executed improvement is worth more than five abandoned ones.
Psychological Safety Is a Prerequisite
A retrospective only works if team members feel safe to speak honestly about what is not working. Without psychological safety, retrospectives produce sanitized, non-actionable discussion.
Signs of Low Psychological Safety
- Only senior team members speak
- Nobody mentions problems - everything is “fine”
- Issues that everyone knows about are never raised
- Team members vent privately after the retrospective instead of during it
- Action items are always about tools or processes, never about behaviors
Building Psychological Safety
| Practice |
Why It Helps |
| Leader speaks last |
Prevents the leader’s opinion from anchoring the discussion |
| Anonymous input |
Use sticky notes or digital tools where input is anonymous initially |
| Blame-free language |
“The deploy failed” not “You broke the deploy” |
| Follow through on raised issues |
Nothing destroys safety faster than raising a concern and having it ignored |
| Acknowledge mistakes openly |
Leaders who admit their own mistakes make it safe for others to do the same |
| Separate retrospective from performance review |
If retro content affects reviews, people will not be honest |
Treat Improvement as a Deliverable
The most common failure mode for retrospectives is producing action items that never get done. This happens when improvement work is treated as something to do “when we have time” - which means never.
Make Improvement Visible
- Add improvement items to the same board as feature work
- Include improvement items in WIP limits
- Track improvement items through the same workflow as any other deliverable
Allocate Capacity
Reserve a percentage of team capacity for improvement work. Common allocations:
| Allocation |
Approach |
| 20% continuous |
One day per week (or equivalent) dedicated to improvement, tooling, and tech debt |
| Dedicated improvement sprint |
Every 4th sprint is entirely improvement-focused |
| Improvement as first pull |
When someone finishes work and the WIP limit allows, the first option is an improvement item |
The specific allocation matters less than having one. A team that explicitly budgets 10% for improvement will improve more than a team that aspires to 20% but never protects the time.
Retrospective Cadence
| Cadence |
Best For |
Caution |
| Weekly |
Teams in active CD migration, teams working through major changes |
Can feel like too many meetings if not well-facilitated |
| Bi-weekly |
Teams in steady state with ongoing improvement |
Most common cadence |
| After incidents |
Any team |
Incident retrospectives (postmortems) are separate from regular retrospectives |
| Monthly |
Mature teams with well-established improvement habits |
Too infrequent for teams early in their migration |
During active phases of a CD migration (Phases 1-3), weekly retrospectives are recommended. Once the team reaches Phase 4, bi-weekly is usually sufficient.
Running Your First CD Migration Retrospective
If your team has not been running effective retrospectives, start here:
Before the Retrospective
- Collect your DORA metrics for the past two weeks
- Review any action items from the previous retrospective (if applicable)
- Prepare a shared document or board with the five-part structure
During the Retrospective (60 minutes)
- Review mission (5 min): State your CD migration goal for this phase
- Review KPIs (10 min): Present the DORA metrics. Ask: “What do you notice?”
- Review experiments (10 min): Discuss any experiments that were run
- Check goals (10 min): Review action items from last time
- Open conversation (25 min): Use Start/Stop/Continue for the first time - it is the simplest format
After the Retrospective
- Publish the action items where the team will see them daily
- Assign owners and due dates
- Add improvement items to the team board
- Schedule the next retrospective
Key Pitfalls
1. “Our retrospectives always produce the same complaints”
If the same issues surface repeatedly, the team is not executing on its action items. Check whether improvement work is being prioritized alongside feature work. If it is not, no amount of retrospective technique will help.
2. “People don’t want to attend because nothing changes”
This is a symptom of the same problem - action items are not executed. The fix is to start small: commit to one action item per retrospective, execute it completely, and demonstrate the result at the next retrospective. Success builds momentum.
3. “The retrospective turns into a blame session”
The facilitator must enforce blame-free language. Redirect “You did X wrong” to “When X happened, the impact was Y. How can we prevent Y?” If blame is persistent, the team has a psychological safety problem that needs to be addressed separately.
4. “We don’t have time for retrospectives”
A team that does not have time to improve will never improve. A 60-minute retrospective that produces one executed improvement is the highest-leverage hour of the entire sprint.
Measuring Success
| Indicator |
Target |
Why It Matters |
| Retrospective attendance |
100% of team |
Confirms the team values the practice |
| Action items completed |
> 80% completion rate |
Confirms improvement is treated as a deliverable |
| DORA metrics trend |
Improving quarter over quarter |
Confirms retrospectives lead to real improvement |
| Team engagement |
Voluntary contributions increasing |
Confirms psychological safety is present |
Next Step
With metrics-driven improvement and effective retrospectives, you have the engine for continuous improvement. The final optimization step is Architecture Decoupling - ensuring your system’s architecture does not prevent you from deploying independently.
This content is adapted from the Dojo Consortium,
licensed under CC BY 4.0.
3.4.6 - Architecture Decoupling
Enable independent deployment of components by decoupling architecture boundaries.
Phase 3 - Optimize | Original content based on Dojo Consortium delivery journey patterns
You cannot deploy independently if your architecture requires coordinated releases. This page describes the three architecture states teams encounter on the journey to continuous deployment and provides practical strategies for moving from entangled to loosely coupled.
Why Architecture Matters for CD
Every practice in this guide - small batches, feature flags, WIP limits - assumes that your team can deploy its changes independently. But if your application is a monolith where changing one module requires retesting everything, or a set of microservices with tightly coupled APIs, independent deployment is impossible regardless of how good your practices are.
Architecture is either an enabler or a blocker for continuous deployment. There is no neutral.
Three Architecture States
The Delivery System Improvement Journey describes three states that teams move through. Most teams start entangled. The goal is to reach loosely coupled.
State 1: Entangled
In an entangled architecture, everything is connected to everything. Changes in one area routinely break other areas. Teams cannot deploy independently.
Characteristics:
- Shared database schemas with no ownership boundaries
- Circular dependencies between modules or services
- Deploying one service requires deploying three others at the same time
- Integration testing requires the entire system to be running
- A single team’s change can block every other team’s release
- “Big bang” releases on a fixed schedule
Impact on delivery:
| Metric |
Typical State |
| Deployment frequency |
Monthly or quarterly (because coordinating releases is hard) |
| Lead time |
Weeks to months (because changes wait for the next release train) |
| Change failure rate |
High (because big releases mean big risk) |
| MTTR |
Long (because failures cascade across boundaries) |
How you got here: Entanglement is the natural result of building quickly without deliberate architectural boundaries. It is not a failure - it is a stage that almost every system passes through.
State 2: Tightly Coupled
In a tightly coupled architecture, there are identifiable boundaries between components, but those boundaries are leaky. Teams have some independence, but coordination is still required for many changes.
Characteristics:
- Services exist but share a database or use synchronous point-to-point calls
- API contracts exist but are not versioned - breaking changes require simultaneous updates
- Teams can deploy some changes independently, but cross-cutting changes require coordination
- Integration testing requires multiple services but not the entire system
- Release trains still exist but are smaller and more frequent
Impact on delivery:
| Metric |
Typical State |
| Deployment frequency |
Weekly to bi-weekly |
| Lead time |
Days to a week |
| Change failure rate |
Moderate (improving but still affected by coupling) |
| MTTR |
Hours (failures are more isolated but still cascade sometimes) |
State 3: Loosely Coupled
In a loosely coupled architecture, components communicate through well-defined interfaces, own their own data, and can be deployed independently without coordinating with other teams.
Characteristics:
- Each service owns its own data store - no shared databases
- APIs are versioned; consumers and producers can be updated independently
- Asynchronous communication (events, queues) is used where possible
- Each team can deploy without coordinating with any other team
- Services are designed to degrade gracefully if a dependency is unavailable
- No release trains - each team deploys when ready
Impact on delivery:
| Metric |
Typical State |
| Deployment frequency |
On-demand (multiple times per day) |
| Lead time |
Hours |
| Change failure rate |
Low (small, isolated changes) |
| MTTR |
Minutes (failures are contained within service boundaries) |
Moving from Entangled to Tightly Coupled
This is the first and most difficult transition. It requires establishing boundaries where none existed before.
Strategy 1: Identify Natural Seams
Look for places where the system already has natural boundaries, even if they are not enforced:
- Different business domains: Orders, payments, inventory, and user accounts are different domains even if they live in the same codebase.
- Different rates of change: Code that changes weekly and code that changes yearly should not be in the same deployment unit.
- Different scaling needs: Components with different load profiles benefit from separate deployment.
- Different team ownership: If different teams work on different parts of the codebase, those parts are candidates for separation.
Strategy 2: Strangler Fig Pattern
Instead of rewriting the system, incrementally extract components from the monolith.
Step 1: Route all traffic through a facade/proxy
Step 2: Build the new component alongside the old
Step 3: Route a small percentage of traffic to the new component
Step 4: Validate correctness and performance
Step 5: Route all traffic to the new component
Step 6: Remove the old code
Key rule: The strangler fig pattern must be done incrementally. If you try to extract everything at once, you are doing a rewrite, not a strangler fig.
Strategy 3: Define Ownership Boundaries
Assign clear ownership of each module or component to a single team. Ownership means:
- The owning team decides the API contract
- The owning team deploys the component
- Other teams consume the API, not the internal implementation
- Changes to the API contract require agreement from consumers (but not simultaneous deployment)
What to Avoid
- The “big rewrite”: Rewriting a monolith from scratch almost always fails. Use the strangler fig pattern instead.
- Premature microservices: Do not split into microservices until you have clear domain boundaries and team ownership. Microservices with unclear boundaries are a distributed monolith - the worst of both worlds.
- Shared databases across services: This is the most common coupling mechanism. If two services share a database, they cannot be deployed independently because a schema change in one service can break the other.
Moving from Tightly Coupled to Loosely Coupled
This transition is about hardening the boundaries that were established in the previous step.
Strategy 1: Eliminate Shared Data Stores
If two services share a database, one of three things needs to happen:
- One service owns the data, the other calls its API. The dependent service no longer accesses the database directly.
- The data is duplicated. Each service maintains its own copy, synchronized via events.
- The shared data becomes a dedicated data service. Both services consume from a service that owns the data.
BEFORE (shared database):
Service A → [Shared DB] ← Service B
AFTER (option 1 - API ownership):
Service A → [DB A]
Service B → Service A API → [DB A]
AFTER (option 2 - event-driven duplication):
Service A → [DB A] → Events → Service B → [DB B]
AFTER (option 3 - data service):
Service A → Data Service → [DB]
Service B → Data Service → [DB]
Strategy 2: Version Your APIs
API versioning allows consumers and producers to evolve independently.
Rules for API versioning:
- Never make a breaking change without a new version. Adding fields is non-breaking. Removing fields is breaking. Changing field types is breaking.
- Support at least two versions simultaneously. This gives consumers time to migrate.
- Deprecate old versions with a timeline. “Version 1 will be removed on date X.”
- Use consumer-driven contract tests to verify compatibility. See Contract Testing.
Strategy 3: Prefer Asynchronous Communication
Synchronous calls (HTTP, gRPC) create temporal coupling: if the downstream service is slow or unavailable, the upstream service is also affected.
| Communication Style |
Coupling |
When to Use |
| Synchronous (HTTP/gRPC) |
Temporal + behavioral |
When the caller needs an immediate response |
| Asynchronous (events/queues) |
Behavioral only |
When the caller does not need an immediate response |
| Event-driven (publish/subscribe) |
Minimal |
When the producer does not need to know about consumers |
Prefer asynchronous communication wherever the business requirements allow it. Not every interaction needs to be synchronous.
Strategy 4: Design for Failure
In a loosely coupled system, dependencies will be unavailable sometimes. Design for this:
- Circuit breakers: Stop calling a failing dependency after N failures. Return a degraded response instead.
- Timeouts: Set aggressive timeouts on all external calls. A 30-second timeout on a service that should respond in 100ms is not a timeout - it is a hang.
- Bulkheads: Isolate failures so that one failing dependency does not consume all resources.
- Graceful degradation: Define what the user experience should be when a dependency is down. “Recommendations unavailable” is better than a 500 error.
Practical Steps for Architecture Decoupling
Month 1: Map Dependencies
Before changing anything, understand what you have:
- Draw a dependency graph. Which components depend on which? Where are the shared databases?
- Identify deployment coupling. Which components must be deployed together? Why?
- Identify the highest-impact coupling. Which coupling most frequently blocks independent deployment?
Month 2-3: Establish the First Boundary
Pick one component to decouple. Choose the one with the highest impact and lowest risk:
- Apply the strangler fig pattern to extract it
- Define a clear API contract
- Move its data to its own data store
- Deploy it independently
Month 4+: Repeat
Take the next highest-impact coupling and address it. Each decoupling makes the next one easier because the team learns the patterns and the remaining system is simpler.
Key Pitfalls
1. “We need to rewrite everything before we can deploy independently”
No. Decoupling is incremental. Extract one component, deploy it independently, prove the pattern works, then continue. A partial decoupling that enables one team to deploy independently is infinitely more valuable than a planned rewrite that never finishes.
2. “We split into microservices but our lead time got worse”
Microservices add operational complexity (more services to deploy, monitor, and debug). If you split without investing in deployment automation, observability, and team autonomy, you will get worse, not better. Microservices are a tool for organizational scaling, not a silver bullet for delivery speed.
3. “Teams keep adding new dependencies that recouple the system”
Architecture decoupling requires governance. Establish architectural principles (e.g., “no shared databases”) and enforce them through automated checks (e.g., dependency analysis in CI) and architecture reviews for cross-boundary changes.
4. “We can’t afford the time to decouple”
You cannot afford not to. Every week spent doing coordinated releases is a week of delivery capacity lost to coordination overhead. The investment in decoupling pays for itself quickly through increased deployment frequency and reduced coordination cost.
Measuring Success
| Metric |
Target |
Why It Matters |
| Teams that can deploy independently |
Increasing |
The primary measure of decoupling |
| Coordinated releases per quarter |
Decreasing toward zero |
Confirms coupling is being eliminated |
| Deployment frequency per team |
Increasing independently |
Confirms teams are not blocked by each other |
| Cross-team dependencies per feature |
Decreasing |
Confirms architecture supports independent work |
Next Step
With optimized flow, small batches, metrics-driven improvement, and a decoupled architecture, your team is ready for the final phase. Continue to Phase 4: Deliver on Demand.
3.5 - Phase 4: Deliver on Demand
The capability to deploy any change to production at any time, using the delivery strategy that fits your context.
Key question: “Can we deliver any change to production when the business needs it?”
This is the destination: you can deploy any change that passes the pipeline to production
whenever you choose. Some teams will auto-deploy every commit (continuous deployment). Others
will deploy on demand when the business is ready. Both are valid - the capability is what
matters, not the trigger.
What You’ll Do
- Deploy on demand - Remove the last manual gates so any green build can reach production
- Use progressive rollout - Canary, blue-green, and percentage-based deployments
- Explore agentic CD - AI-assisted continuous delivery patterns
- Learn from experience reports - How other teams made the journey
Continuous Delivery vs. Continuous Deployment
These terms are often confused. The distinction matters for this phase:
- Continuous delivery means every commit that passes the pipeline could be deployed to
production at any time. The capability exists. A human or business process decides when.
- Continuous deployment means every commit that passes the pipeline is deployed to
production automatically. No human decision is involved.
Continuous delivery is the goal of this migration guide. Continuous deployment is one delivery
strategy that works well for certain contexts - SaaS products, internal tools, services behind
feature flags. It is not a higher level of maturity. A team that deploys on demand with a
one-click deploy is just as capable as a team that auto-deploys every commit.
Why This Phase Matters
When your foundations are solid, your pipeline is reliable, and your batch sizes are small,
deploying any change becomes low-risk. The remaining barriers are organizational, not
technical: approval processes, change windows, release coordination. This phase addresses those
barriers so the team has the option to deploy whenever the business needs it.
Signs You’ve Arrived
- Any commit that passes the pipeline can reach production within minutes
- The team deploys frequently (daily or more) with no drama
- Mean time to recovery is measured in minutes
- The team has confidence that any deployment can be safely rolled back
- New team members can deploy on their first day
- The deployment strategy (on-demand or automatic) is a team choice, not a constraint
3.5.1 - Deploy on Demand
Remove the last manual gates and deploy every change that passes the pipeline.
Phase 4 - Deliver on Demand | Original content
Deploy on demand means that any change which passes the full automated pipeline can reach production without waiting for a human to press a button, open a ticket, or schedule a window. This page covers the prerequisites, the transition from continuous delivery to continuous deployment, and how to address the organizational concerns that are the real barriers.
Continuous Delivery vs. Continuous Deployment
These two terms are often confused. The distinction matters:
- Continuous Delivery: Every commit that passes the pipeline could be deployed to production. A human decides when to deploy.
- Continuous Deployment: Every commit that passes the pipeline is deployed to production. No human decision is required.
If you have completed Phases 1-3 of this migration, you have continuous delivery. This page is about removing that last manual decision and moving to continuous deployment.
Why Remove the Last Gate?
The manual deployment decision feels safe. It gives someone a chance to “eyeball” the change before it goes to production. In practice, it does the opposite.
The Problems with Manual Gates
| Problem |
Why It Happens |
Impact |
| Batching |
If deploys are manual, teams batch changes to reduce the number of deploy events |
Larger batches increase risk and make rollback harder |
| Delay |
Changes wait for someone to approve, which may take hours or days |
Longer lead time, delayed feedback |
| False confidence |
The approver cannot meaningfully review what the automated pipeline already tested |
The gate provides the illusion of safety without actual safety |
| Bottleneck |
One person or team becomes the deploy gatekeeper |
Creates a single point of failure for the entire delivery flow |
| Deploy fear |
Infrequent deploys mean each deploy is higher stakes |
Teams become more cautious, batches get larger, risk increases |
The Paradox of Manual Safety
The more you rely on manual deployment gates, the less safe your deployments become. This is because manual gates lead to batching, batching increases risk, and increased risk justifies more manual gates. It is a vicious cycle.
Continuous deployment breaks this cycle. Small, frequent, automated deployments are individually low-risk. If one fails, the blast radius is small and recovery is fast.
Prerequisites for Deploy on Demand
Before removing manual gates, verify that these conditions are met. Each one is covered in earlier phases of this migration.
Non-Negotiable Prerequisites
| Prerequisite |
What It Means |
Where to Build It |
| Comprehensive automated tests |
The test suite catches real defects, not just trivial cases |
Testing Fundamentals |
| Fast, reliable pipeline |
The pipeline completes in under 15 minutes and rarely fails for non-code reasons |
Deterministic Pipeline |
| Automated rollback |
You can roll back a bad deployment in minutes without manual intervention |
Rollback |
| Feature flags |
Incomplete features are hidden from users via flags, not deployment timing |
Feature Flags |
| Small batch sizes |
Each deployment contains 1-3 small changes, not dozens |
Small Batches |
| Production-like environments |
Test environments match production closely enough that test results are trustworthy |
Production-Like Environments |
| Observability |
You can detect production issues within minutes through monitoring and alerting |
Metrics-Driven Improvement |
Assessment: Are You Ready?
Answer these questions honestly:
- When was the last time your pipeline caught a real bug? If the answer is “I don’t remember,” your test suite may not be trustworthy enough.
- How long does a rollback take? If the answer is more than 15 minutes, automate it first.
- Do deploys ever fail for non-code reasons? (Environment issues, credential problems, network flakiness.) If yes, stabilize your pipeline first.
- Does the team trust the pipeline? If team members regularly say “let me check one more thing before we deploy,” trust is not there yet. Build it through retrospectives and transparent metrics.
The Transition: Three Approaches
Approach 1: Shadow Mode
Run continuous deployment alongside manual deployment. Every change that passes the pipeline is automatically deployed to a shadow production environment (or a canary group). A human still approves the “real” production deployment.
Duration: 2-4 weeks.
What you learn: How often the automated deployment would have been correct. If the answer is “every time” (or close to it), the manual gate is not adding value.
Transition: Once the team sees that the shadow deployments are consistently safe, remove the manual gate.
Approach 2: Opt-In per Team
Allow individual teams to adopt continuous deployment while others continue with manual gates. This works well in organizations with multiple teams at different maturity levels.
Duration: Ongoing. Teams opt in when they are ready.
What you learn: Which teams are ready and which need more foundation work. Early adopters demonstrate the pattern for the rest of the organization.
Transition: As more teams succeed, continuous deployment becomes the default. Remaining teams are supported in reaching readiness.
Approach 3: Direct Switchover
Remove the manual gate for all teams at once. This is appropriate when the organization has high confidence in its pipeline and all teams have completed Phases 1-3.
Duration: Immediate.
What you learn: Quickly reveals any hidden dependencies on the manual gate (e.g., deploy coordination between teams, configuration changes that ride along with deployments).
Transition: Be prepared to temporarily revert if unforeseen issues arise. Have a clear rollback plan for the process change itself.
Addressing Organizational Concerns
The technical prerequisites are usually met before the organizational ones. These are the conversations you will need to have.
“What about change management / ITIL?”
Change management frameworks like ITIL define a “standard change” category: a pre-approved, low-risk, well-understood change that does not require a Change Advisory Board (CAB) review. Continuous deployment changes qualify as standard changes because they are:
- Small (one to a few commits)
- Automated (same pipeline every time)
- Reversible (automated rollback)
- Well-tested (comprehensive automated tests)
Work with your change management team to classify pipeline-passing deployments as standard changes. This preserves the governance framework while removing the bottleneck.
“What about compliance and audit?”
Continuous deployment does not eliminate audit trails - it strengthens them. Every deployment is:
- Traceable: Tied to a specific commit, which is tied to a specific story or ticket
- Reproducible: The same pipeline produces the same result every time
- Recorded: Pipeline logs capture every test that passed, every approval that was automated
- Reversible: Rollback history shows when and why a deployment was reverted
Provide auditors with access to pipeline logs, deployment history, and the automated test suite. This is a more complete audit trail than a manual approval signature.
“What about database migrations?”
Database migrations require special care in continuous deployment because they cannot be rolled back as easily as code changes.
Rules for database migrations in CD:
- Migrations must be backward-compatible. The previous version of the code must work with the new schema.
- Use expand/contract pattern. First deploy the new column/table (expand). Then deploy the code that uses it. Then remove the old column/table (contract). Each step is a separate deployment.
- Never drop a column in the same deployment that stops using it. There is always a window where both old and new code run simultaneously.
- Test migrations in production-like environments before they reach production.
“What if we deploy a breaking change?”
This is why you have automated rollback and observability. The sequence is:
- Deployment happens automatically
- Monitoring detects an issue (error rate spike, latency increase, health check failure)
- Automated rollback triggers (or on-call engineer triggers manual rollback)
- The team investigates and fixes the issue
- The fix goes through the pipeline and deploys automatically
The key insight: this sequence takes minutes with continuous deployment. With manual deployment on a weekly schedule, the same breaking change would take days to detect and fix.
After the Transition
What Changes for the Team
| Before |
After |
| “Are we deploying today?” |
Deploys happen automatically, all the time |
| “Who’s doing the deploy?” |
Nobody - the pipeline does it |
| “Can I get this into the next release?” |
Every merge to trunk is the next release |
| “We need to coordinate the deploy with team X” |
Teams deploy independently |
| “Let’s wait for the deploy window” |
There are no deploy windows |
What Stays the Same
- Code review still happens (before merge to trunk)
- Automated tests still run (in the pipeline)
- Feature flags still control feature visibility (decoupling deploy from release)
- Monitoring still catches issues (but now recovery is faster)
- The team still owns its deployments (but the manual step is gone)
The First Week
The first week of continuous deployment will feel uncomfortable. This is normal. The team will instinctively want to “check” deployments that happen automatically. Resist the urge to add manual checks back. Instead:
- Watch the monitoring dashboards more closely than usual
- Have the team discuss each automatic deployment in standup for the first week
- Celebrate the first deployment that goes out without anyone noticing - that is the goal
Key Pitfalls
1. “We adopted continuous deployment but kept the approval step ‘just in case’”
If the approval step exists, it will be used, and you have not actually adopted continuous deployment. Remove the gate completely. If something goes wrong, use rollback - do not use a pre-deployment gate.
2. “Our deploy cadence didn’t actually increase”
Continuous deployment only increases deploy frequency if the team is integrating to trunk frequently. If the team still merges weekly, they will deploy weekly - automatically, but still weekly. Revisit Trunk-Based Development and Small Batches.
3. “We have continuous deployment for the application but not the database/infrastructure”
Partial continuous deployment creates a split experience: application changes flow freely but infrastructure changes still require manual coordination. Extend the pipeline to cover infrastructure as code, database migrations, and configuration changes.
Measuring Success
| Metric |
Target |
Why It Matters |
| Deployment frequency |
Multiple per day |
Confirms the pipeline is deploying every change |
| Lead time |
< 1 hour from commit to production |
Confirms no manual gates are adding delay |
| Manual interventions per deploy |
Zero |
Confirms the process is fully automated |
| Change failure rate |
Stable or improving |
Confirms automation is not introducing new failures |
| MTTR |
< 15 minutes |
Confirms automated rollback is working |
Next Step
Continuous deployment deploys every change, but not every change needs to go to every user at once. Progressive Rollout strategies let you control who sees a change and how quickly it spreads.
3.5.2 - Progressive Rollout
Use canary, blue-green, and percentage-based deployments to reduce deployment risk.
Phase 4 - Deliver on Demand | Original content
Progressive rollout strategies let you deploy to production without deploying to all users simultaneously. By exposing changes to a small group first and expanding gradually, you catch problems before they affect your entire user base. This page covers the three major strategies, when to use each, and how to implement automated rollback.
Why Progressive Rollout?
Even with comprehensive tests, production-like environments, and small batch sizes, some issues only surface under real production traffic. Progressive rollout is the final safety layer: it limits the blast radius of any deployment by exposing the change to a small audience first.
This is not a replacement for testing. It is an addition. Your automated tests should catch the vast majority of issues. Progressive rollout catches the rest - the issues that depend on real user behavior, real data volumes, or real infrastructure conditions that cannot be fully replicated in test environments.
The Three Strategies
Strategy 1: Canary Deployment
A canary deployment routes a small percentage of production traffic to the new version while the majority continues to hit the old version. If the canary shows no problems, traffic is gradually shifted.
┌─────────────────┐
5% │ New Version │ ← Canary
┌──────►│ (v2) │
│ └─────────────────┘
Traffic ──────┤
│ ┌─────────────────┐
└──────►│ Old Version │ ← Stable
95% │ (v1) │
└─────────────────┘
How it works:
- Deploy the new version alongside the old version
- Route 1-5% of traffic to the new version
- Compare key metrics (error rate, latency, business metrics) between canary and stable
- If metrics are healthy, increase traffic to 25%, 50%, 100%
- If metrics degrade, route all traffic back to the old version
When to use canary:
- Changes that affect request handling (API changes, performance optimizations)
- Changes where you want to compare metrics between old and new versions
- Services with high traffic volume (you need enough canary traffic for statistical significance)
When canary is not ideal:
- Changes that affect batch processing or background jobs (no “traffic” to route)
- Very low traffic services (the canary may not get enough traffic to detect issues)
- Database schema changes (both versions must work with the same schema)
Implementation options:
| Infrastructure |
How to Route Traffic |
| Kubernetes + service mesh (Istio, Linkerd) |
Weighted routing rules in VirtualService |
| Load balancer (ALB, NGINX) |
Weighted target groups |
| CDN (CloudFront, Fastly) |
Origin routing rules |
| Application-level |
Feature flag with percentage rollout |
Strategy 2: Blue-Green Deployment
Blue-green deployment maintains two identical production environments. At any time, one (blue) serves live traffic and the other (green) is idle or staging.
BEFORE:
Traffic ──────► [Blue - v1] (ACTIVE)
[Green] (IDLE)
DEPLOY:
Traffic ──────► [Blue - v1] (ACTIVE)
[Green - v2] (DEPLOYING / SMOKE TESTING)
SWITCH:
Traffic ──────► [Green - v2] (ACTIVE)
[Blue - v1] (STANDBY / ROLLBACK TARGET)
How it works:
- Deploy the new version to the idle environment (green)
- Run smoke tests against green to verify basic functionality
- Switch the router/load balancer to point all traffic at green
- Keep blue running as an instant rollback target
- After a stability period, repurpose blue for the next deployment
When to use blue-green:
- You need instant, complete rollback (switch the router back)
- You want to test the deployment in a full production environment before routing traffic
- Your infrastructure supports running two parallel environments cost-effectively
When blue-green is not ideal:
- Stateful applications where both environments share mutable state
- Database migrations (the new version’s schema must work for both environments during transition)
- Cost-sensitive environments (maintaining two full production environments doubles infrastructure cost)
Rollback speed: Seconds. Switching the router back is the fastest rollback mechanism available.
Strategy 3: Percentage-Based Rollout
Percentage-based rollout gradually increases the number of users who see the new version. Unlike canary (which is traffic-based), percentage rollout is typically user-based - a specific user always sees the same version during the rollout period.
Hour 0: 1% of users → v2, 99% → v1
Hour 2: 5% of users → v2, 95% → v1
Hour 8: 25% of users → v2, 75% → v1
Day 2: 50% of users → v2, 50% → v1
Day 3: 100% of users → v2
How it works:
- Enable the new version for a small percentage of users (using feature flags or infrastructure routing)
- Monitor metrics for the affected group
- Gradually increase the percentage over hours or days
- At any point, reduce the percentage back to 0% if issues are detected
When to use percentage rollout:
- User-facing feature changes where you want consistent user experience (a user always sees v1 or v2, not a random mix)
- Changes that benefit from A/B testing data (compare user behavior between groups)
- Long-running rollouts where you want to collect business metrics before full exposure
When percentage rollout is not ideal:
- Backend infrastructure changes with no user-visible impact
- Changes that affect all users equally (e.g., API response format changes)
Implementation: Percentage rollout is typically implemented through Feature Flags (Level 2 or Level 3), using the user ID as the hash key to ensure consistent assignment.
Choosing the Right Strategy
| Factor |
Canary |
Blue-Green |
Percentage |
| Rollback speed |
Seconds (reroute traffic) |
Seconds (switch environments) |
Seconds (disable flag) |
| Infrastructure cost |
Low (runs alongside existing) |
High (two full environments) |
Low (same infrastructure) |
| Metric comparison |
Strong (side-by-side comparison) |
Weak (before/after only) |
Strong (group comparison) |
| User consistency |
No (each request may hit different version) |
Yes (all users see same version) |
Yes (each user sees consistent version) |
| Complexity |
Moderate |
Moderate |
Low (if you have feature flags) |
| Best for |
API changes, performance changes |
Full environment validation |
User-facing features |
Many teams use more than one strategy. A common pattern:
- Blue-green for infrastructure and platform changes
- Canary for service-level changes
- Percentage rollout for user-facing feature changes
Automated Rollback
Progressive rollout is only effective if rollback is automated. A human noticing a problem at 3 AM is not a reliable rollback mechanism.
Metrics to Monitor
Define automated rollback triggers before deploying. Common triggers:
| Metric |
Trigger Condition |
Example |
| Error rate |
Canary error rate > 2x stable error rate |
Stable: 0.1%, Canary: 0.3% -> rollback |
| Latency (p99) |
Canary p99 > 1.5x stable p99 |
Stable: 200ms, Canary: 400ms -> rollback |
| Health check |
Any health check failure |
HTTP 500 on /health -> rollback |
| Business metric |
Conversion rate drops > 5% for canary group |
10% conversion -> 4% conversion -> rollback |
| Saturation |
CPU or memory exceeds threshold |
CPU > 90% for 5 minutes -> rollback |
Automated Rollback Flow
Deploy new version
│
▼
Route 5% of traffic to new version
│
▼
Monitor for 15 minutes
│
├── Metrics healthy ──────► Increase to 25%
│ │
│ ▼
│ Monitor for 30 minutes
│ │
│ ├── Metrics healthy ──────► Increase to 100%
│ │
│ └── Metrics degraded ─────► ROLLBACK
│
└── Metrics degraded ─────► ROLLBACK
| Tool |
How It Helps |
| Argo Rollouts |
Kubernetes-native progressive delivery with automated analysis and rollback |
| Flagger |
Progressive delivery operator for Kubernetes with Istio, Linkerd, or App Mesh |
| Spinnaker |
Multi-cloud deployment platform with canary analysis |
| Custom scripts |
Query your metrics system, compare thresholds, trigger rollback via API |
The specific tool matters less than the principle: define rollback criteria before deploying, monitor automatically, and roll back without human intervention.
Implementing Progressive Rollout
Step 1: Choose Your First Strategy (Week 1)
Pick the strategy that matches your infrastructure:
- If you already have feature flags: start with percentage-based rollout
- If you have Kubernetes with a service mesh: start with canary
- If you have parallel environments: start with blue-green
Step 2: Define Rollback Criteria (Week 1)
Before your first progressive deployment:
- Identify the 3-5 metrics that define “healthy” for your service
- Define numerical thresholds for each metric
- Define the monitoring window (how long to wait before advancing)
- Document the rollback procedure (even if automated, document it for human understanding)
Step 3: Run a Manual Progressive Rollout (Week 2-3)
Before automating, run the process manually:
- Deploy to a canary or small percentage
- A team member monitors the dashboard for the defined window
- The team member decides to advance or rollback
- Document what they checked and how they decided
This manual practice builds understanding of what the automation will do.
Step 4: Automate the Rollout (Week 4+)
Replace the manual monitoring with automated checks:
- Implement metric queries that check your rollback criteria
- Implement automated traffic shifting (advance or rollback based on metrics)
- Implement alerting so the team knows when a rollback occurs
- Test the automation by intentionally deploying a known-bad change (in a controlled way)
Key Pitfalls
1. “Our canary doesn’t get enough traffic for meaningful metrics”
If your service handles 100 requests per hour, a 5% canary gets 5 requests per hour - not enough to detect problems statistically. Solutions: use a higher canary percentage (25-50%), use longer monitoring windows, or use blue-green instead (which does not require traffic splitting).
2. “We have progressive rollout but rollback is still manual”
Progressive rollout without automated rollback is half a solution. If the canary shows problems at 2 AM and nobody is watching, the damage occurs before anyone responds. Automated rollback is the essential companion to progressive rollout.
3. “We treat progressive rollout as a replacement for testing”
Progressive rollout is the last line of defense, not the first. If you are regularly catching bugs in canary that your test suite should have caught, your test suite needs improvement. Progressive rollout should catch rare, production-specific issues - not common bugs.
4. “Our rollout takes days because we’re too cautious”
A rollout that takes a week negates the benefits of continuous deployment. If your confidence in the pipeline is low enough to require a week-long rollout, the issue is pipeline quality, not rollout speed. Address the root cause through better testing and more production-like environments.
Measuring Success
| Metric |
Target |
Why It Matters |
| Automated rollbacks per month |
Low and stable |
Confirms the pipeline catches most issues before production |
| Time from deploy to full rollout |
Hours, not days |
Confirms the team has confidence in the process |
| Incidents caught by progressive rollout |
Tracked (any number) |
Confirms the progressive rollout is providing value |
| Manual interventions during rollout |
Zero |
Confirms the process is fully automated |
Next Step
With deploy on demand and progressive rollout, your technical deployment infrastructure is complete. Agentic CD explores how AI-assisted patterns can extend these practices further.
3.5.3 - Agentic CD
Extend continuous deployment with constraints and practices for AI agent-generated changes.
Phase 4 - Deliver on Demand | Adapted from MinimumCD.org
As AI coding agents become capable of generating production-ready code changes, the continuous deployment pipeline must evolve to handle agent-generated work with the same rigor applied to human-generated work - and in some cases, more rigor. Agentic CD defines the additional constraints and artifacts needed when agents contribute to the delivery pipeline.
What Is Agentic CD?
Agentic CD extends the Minimum CD framework to address a new category of contributor: AI agents that can generate, test, and propose code changes. These agents may operate autonomously (generating changes without human prompting) or collaboratively (assisting a human developer).
The core principle is simple: an agent-generated change must meet or exceed the same quality bar as a human-generated change. The pipeline does not care who wrote the code. It cares whether the code is correct, tested, and safe to deploy.
But agents introduce unique challenges that require additional constraints:
- Agents can generate changes faster than humans can review them
- Agents may lack context about organizational norms, business rules, or unstated constraints
- Agents cannot currently exercise judgment about risk in the same way humans can
- Agents may introduce subtle correctness issues that pass automated tests but violate intent
The Six First-Class Artifacts
Agentic CD defines six artifacts that must be explicitly maintained in a delivery pipeline that includes AI agents. These artifacts exist in human-driven CD as well, but they are often implicit. When agents are involved, they must be explicit.
1. Intent Description
What it is: A human-readable description of the desired change, written by a human.
Why it matters for agentic CD: The intent description is the agent’s “prompt” in the broadest sense. It defines what the change should accomplish, not how. Without a clear intent description, the agent may generate technically correct code that does not match what was needed.
Example:
Key property: The intent description is authored by a human. It is the human’s specification of what the agent should achieve. The agent does not write or modify the intent description.
2. User-Facing Behavior
What it is: A description of how the system should behave from the user’s perspective, expressed as observable outcomes.
Why it matters for agentic CD: Agents can generate code that satisfies tests but does not produce the expected user experience. User-facing behavior descriptions bridge the gap between technical correctness and user value.
Format: BDD scenarios work well here (see Small Batches):
3. Feature Description
What it is: A technical description of the feature’s architecture, dependencies, and integration points.
Why it matters for agentic CD: Agents need explicit architectural context that human developers often carry in their heads. The feature description tells the agent where the change fits in the system, what components it touches, and what constraints apply.
Example:
4. Executable Truth
What it is: Automated tests that define the correct behavior of the system. These tests are the authoritative source of truth for what the code should do.
Why it matters for agentic CD: For human developers, tests verify the code. For agent-generated code, tests also constrain the agent. If the tests are comprehensive, the agent cannot generate incorrect code that passes. If the tests are shallow, the agent can generate code that passes tests but does not satisfy the intent.
Key principle: Executable truth must be written or reviewed by a human before the agent generates the implementation. This inverts the common practice of writing tests after code. In agentic CD, the tests come first because they are the specification.
5. Implementation
What it is: The actual code that implements the feature. In agentic CD, this may be generated entirely by the agent, co-authored by agent and human, or authored by a human with agent assistance.
Why it matters for agentic CD: The implementation is the artifact most likely to be agent-generated. The key requirement is that it must satisfy the executable truth (tests), conform to the feature description (architecture), and achieve the intent description (purpose).
Review requirements: Agent-generated implementation must be reviewed by a human before merging to trunk. The review focuses on:
- Does the implementation match the intent? (Not just “does it pass tests?”)
- Does it follow the architectural constraints in the feature description?
- Does it introduce unnecessary complexity, dependencies, or security risks?
- Would a human developer on the team understand and maintain this code?
6. System Constraints
What it is: Non-functional requirements, security policies, performance budgets, and organizational rules that apply to all changes.
Why it matters for agentic CD: Human developers internalize system constraints through experience and team norms. Agents need these constraints stated explicitly.
Examples:
The Agentic CD Workflow
When an AI agent contributes to a CD pipeline, the workflow extends the standard CD pipeline:
1. HUMAN writes Intent Description
2. HUMAN writes or reviews User-Facing Behavior (BDD scenarios)
3. HUMAN writes or reviews Feature Description (architecture)
4. HUMAN writes or reviews Executable Truth (tests)
5. AGENT generates Implementation (code)
6. PIPELINE validates Implementation against Executable Truth (automated tests)
7. HUMAN reviews Implementation (code review)
8. PIPELINE deploys (same pipeline as any other change)
Key differences from standard CD:
- Steps 1-4 happen before the agent generates code (test-first is mandatory, not optional)
- Step 7 (human review) is mandatory for agent-generated code
- System constraints are checked automatically in the pipeline (Step 6)
Constraints for Agent-Generated Changes
Beyond the six artifacts, agentic CD imposes additional constraints on agent-generated changes:
Change Size Limits
Agent-generated changes must be small. Large agent-generated changes are harder to review and more likely to contain subtle issues.
Guideline: An agent-generated change should modify no more files and no more lines than a human would in a single commit. If the change is larger, break it into multiple sequential changes.
Mandatory Human Review
Every agent-generated change must be reviewed by a human before merging to trunk. This is a non-negotiable constraint. The purpose is not to check the agent’s “work” in a supervisory sense - it is to verify that the change matches the intent and fits the system.
Comprehensive Test Coverage
Agent-generated code must have higher test coverage than the team’s baseline. If the team’s baseline is 80% coverage, agent-generated code should target 90%+. This compensates for the reduced human oversight of the implementation details.
Provenance Tracking
The pipeline must record which changes were agent-generated, which agent generated them, and what prompt or intent description was used. This supports audit, debugging, and learning.
Getting Started with Agentic CD
Before jumping into agentic workflows, ensure your team has the prerequisite delivery practices
in place. The AI Adoption Roadmap provides a
step-by-step sequence: quality tools, clear requirements, hardened guardrails, and reduced
delivery friction - all before accelerating with AI coding.
Phase 1: Agent as Assistant
The agent helps human developers write code, but the human makes all decisions and commits all changes. The pipeline does not know or care about agent involvement.
This is where most teams are today. It requires no pipeline changes.
Phase 2: Agent as Contributor
The agent generates complete changes based on intent descriptions and executable truth. A human reviews and merges. The pipeline validates.
Requires: Explicit intent descriptions, test-first workflow, human review gate.
Phase 3: Agent as Autonomous Contributor
The agent generates, tests, and proposes changes with minimal human involvement. Human review is still mandatory, but the agent handles the full cycle from intent to implementation.
Requires: All six first-class artifacts, comprehensive system constraints, provenance tracking, and high confidence in the executable truth.
Key Pitfalls
1. “We let the agent generate tests and code together”
If the agent writes both the tests and the code, the tests may be designed to pass the code rather than to verify the intent. Tests must be written or reviewed by a human before the agent generates the implementation. This is the most important constraint in agentic CD.
2. “The agent generates changes faster than we can review them”
This is a feature, not a bug - but only if you have the discipline to not merge unreviewed changes. The agent’s speed should not pressure humans to review faster. WIP limits apply: if the review queue is full, the agent stops generating new changes.
3. “We trust the agent because it passed the tests”
Passing tests is necessary but not sufficient. Tests cannot verify intent, architectural fitness, or maintainability. Human review remains mandatory.
4. “We don’t track which changes are agent-generated”
Without provenance tracking, you cannot learn from agent-generated failures, audit agent behavior, or improve the agent’s constraints over time. Track provenance from the start.
Measuring Success
| Metric |
Target |
Why It Matters |
| Agent-generated change failure rate |
Equal to or lower than human-generated |
Confirms agent changes meet the same quality bar |
| Review time for agent-generated changes |
Comparable to human-generated changes |
Confirms changes are reviewable, not rubber-stamped |
| Test coverage for agent-generated code |
Higher than baseline |
Confirms the additional coverage constraint is met |
| Agent-generated changes with complete artifacts |
100% |
Confirms the six-artifact workflow is being followed |
Next Step
For real-world examples of teams that have made the full journey to continuous deployment, see Experience Reports.
This content is adapted from MinimumCD.org,
licensed under CC BY 4.0.
3.5.4 - Experience Reports
Real-world stories from teams that have made the journey to continuous deployment.
Phase 4 - Deliver on Demand | Adapted from MinimumCD.org
Theory is necessary but insufficient. This page collects experience reports from organizations that have adopted continuous deployment at scale, including the challenges they faced, the approaches they took, and the results they achieved. These reports demonstrate that CD is not limited to startups or greenfield projects - it works in large, complex, regulated environments.
Why Experience Reports Matter
Every team considering continuous deployment faces the same objection: “That works for [Google / Netflix / small startups], but our situation is different.” Experience reports counter this objection with evidence. They show that organizations of every size, in every industry, with every kind of legacy system, have found a path to continuous deployment.
No experience report will match your situation exactly. That is not the point. The point is to extract patterns: what obstacles did these teams encounter, and how did they overcome them?
Walmart: CD at Retail Scale
Context
Walmart operates one of the world’s largest e-commerce platforms alongside its massive physical retail infrastructure. Changes to the platform affect millions of transactions per day. The organization had a traditional release process with weekly deployment windows and multi-stage manual approval.
The Challenge
- Scale: Thousands of developers across hundreds of teams
- Risk tolerance: Any outage affects revenue in real time
- Legacy: Decades of existing systems with deep interdependencies
- Regulation: PCI compliance requirements for payment processing
What They Did
- Invested in a centralized deployment platform (OneOps, later Concord) that standardized the deployment pipeline across all teams
- Broke the monolithic release into independent service deployments
- Implemented automated canary analysis for every deployment
- Moved from weekly release trains to on-demand deployment per team
Key Lessons
- Platform investment pays off. Building a shared deployment platform let hundreds of teams adopt CD without each team solving the same infrastructure problems.
- Compliance and CD are compatible. Automated pipelines with full audit trails satisfied PCI requirements more reliably than manual approval processes.
- Cultural change is harder than technical change. Teams that had operated on weekly release cycles for years needed coaching and support to trust automated deployment.
Microsoft: From Waterfall to Daily Deploys
Context
Microsoft’s Azure DevOps (formerly Visual Studio Team Services) team made a widely documented transformation from 3-year waterfall releases to deploying multiple times per day. This transformation happened within one of the largest software organizations in the world.
The Challenge
- History: Decades of waterfall development culture
- Product complexity: A platform used by millions of developers
- Organizational size: Thousands of engineers across multiple time zones
- Customer expectations: Enterprise customers expected stability and predictability
What They Did
- Broke the product into independently deployable services (ring-based deployment)
- Implemented a ring-based rollout: Ring 0 (team), Ring 1 (internal Microsoft users), Ring 2 (select external users), Ring 3 (all users)
- Invested heavily in automated testing, achieving thousands of tests running in minutes
- Moved from a fixed release cadence to continuous deployment with feature flags controlling release
- Used telemetry to detect issues in real-time and automated rollback when metrics degraded
Key Lessons
- Ring-based deployment is progressive rollout. Microsoft’s ring model is an implementation of the progressive rollout strategies described in this guide.
- Feature flags enabled decoupling. By deploying frequently but releasing features incrementally via flags, the team could deploy without worrying about feature completeness.
- The transformation took years, not months. Moving from 3-year cycles to daily deployment was a multi-year journey with incremental progress at each step.
Google: Engineering Productivity at Scale
Context
Google is often cited as the canonical example of continuous deployment, deploying changes to production thousands of times per day across its vast service portfolio.
The Challenge
- Scale: Billions of users, millions of servers
- Monorepo: Most of Google operates from a single repository with billions of lines of code
- Interdependencies: Changes in shared libraries can affect thousands of services
- Velocity: Thousands of engineers committing changes every day
What They Did
- Built a culture of automated testing where tests are a first-class deliverable, not an afterthought
- Implemented a submit queue that runs automated tests on every change before it merges to the trunk
- Invested in build infrastructure (Blaze/Bazel) that can build and test only the affected portions of the codebase
- Used percentage-based rollout for user-facing changes
- Made rollback a one-click operation available to every team
Key Lessons
- Test infrastructure is critical infrastructure. Google’s ability to deploy frequently depends entirely on its ability to test quickly and reliably.
- Monorepo and CD are compatible. The common assumption that CD requires microservices with separate repos is false. Google deploys from a monorepo.
- Invest in tooling before process. Google built the tooling (build systems, test infrastructure, deployment automation) that made good practices the path of least resistance.
Amazon: Two-Pizza Teams and Ownership
Context
Amazon’s transformation to service-oriented architecture and team ownership is one of the most influential in the industry. The “two-pizza team” model and “you build it, you run it” philosophy directly enabled continuous deployment.
The Challenge
- Organizational size: Hundreds of thousands of employees
- System complexity: Thousands of services powering amazon.com and AWS
- Availability requirements: Even brief outages are front-page news
- Pace of innovation: Competitive pressure demands rapid feature delivery
What They Did
- Decomposed the system into independently deployable services, each owned by a small team
- Gave teams full ownership: build, test, deploy, operate, and support
- Built internal deployment tooling (Apollo) that automates canary analysis, rollback, and one-click deployment
- Established the practice of deploying every commit that passes the pipeline, with automated rollback on metric degradation
Key Lessons
- Ownership drives quality. When the team that writes the code also operates it in production, they write better code and build better monitoring.
- Small teams move faster. Two-pizza teams (6-10 people) can make decisions without bureaucratic overhead.
- Automation eliminates toil. Amazon’s internal deployment tooling means that deploying is not a skilled activity - any team member can deploy (and the pipeline usually deploys automatically).
HP: CD in Hardware-Adjacent Software
Context
HP’s LaserJet firmware team demonstrated that continuous delivery principles apply even to embedded software, a domain often considered incompatible with frequent deployment.
The Challenge
- Embedded software: Firmware that runs on physical printers
- Long development cycles: Firmware releases had traditionally been annual
- Quality requirements: Firmware bugs require physical recalls or complex update procedures
- Team size: Large, distributed teams with varying skill levels
What They Did
- Invested in automated testing infrastructure for firmware
- Reduced build times from days to under an hour
- Moved from annual releases to frequent incremental updates
- Implemented continuous integration with automated test suites running on simulator and hardware
Key Lessons
- CD principles are universal. Even embedded firmware can benefit from small batches, automated testing, and continuous integration.
- Build time is a critical constraint. Reducing build time from days to under an hour unlocked the ability to test frequently, which enabled frequent integration, which enabled frequent delivery.
- Results were dramatic: Development costs reduced by approximately 40%, programs delivered on schedule increased by roughly 140%.
Flickr: “10+ Deploys Per Day”
Context
Flickr’s 2009 presentation “10+ Deploys Per Day: Dev and Ops Cooperation” is credited with helping launch the DevOps movement. At a time when most organizations deployed quarterly, Flickr was deploying more than ten times per day.
The Challenge
- Web-scale service: Serving billions of photos to millions of users
- Ops/Dev divide: Traditional separation between development and operations teams
- Fear of change: Deployments were infrequent because they were risky
What They Did
- Built automated infrastructure provisioning and deployment
- Implemented feature flags to decouple deployment from release
- Created a culture of shared responsibility between development and operations
- Made deployment a routine, low-ceremony event that anyone could trigger
- Used IRC bots (and later chat-based tools) to coordinate and log deployments
Key Lessons
- Culture is the enabler. Flickr’s technical practices were important, but the cultural shift - developers and operations working together, shared responsibility, mutual respect - was what made frequent deployment possible.
- Tooling should reduce friction. Flickr’s deployment tools were designed to make deploying as easy as possible. The easier it is to deploy, the more often people deploy, and the smaller each deployment becomes.
- Transparency builds trust. Logging every deployment in a shared channel let everyone see what was deploying, who deployed it, and whether it caused problems. This transparency built organizational trust in frequent deployment.
Common Patterns Across Reports
Despite the diversity of these organizations, several patterns emerge consistently:
1. Investment in Automation Precedes Cultural Change
Every organization built the tooling first. Automated testing, automated deployment, automated rollback - these created the conditions where frequent deployment was possible. Cultural change followed when people saw that the automation worked.
2. Incremental Adoption, Not Big Bang
No organization switched to continuous deployment overnight. They all moved incrementally: shorter release cycles first, then weekly deploys, then daily, then on-demand. Each step built confidence for the next.
3. Team Ownership Is Essential
Organizations that gave teams ownership of their deployments (build it, run it) moved faster than those that kept deployment as a centralized function. Ownership creates accountability, which drives quality.
4. Feature Flags Are Universal
Every organization in these reports uses feature flags to decouple deployment from release. This is not optional for continuous deployment - it is foundational.
5. The Results Are Consistent
Regardless of industry, size, or starting point, organizations that adopt continuous deployment consistently report:
- Higher deployment frequency (daily or more)
- Lower change failure rate (small changes fail less)
- Faster recovery (automated rollback, small blast radius)
- Higher developer satisfaction (less toil, more impact)
- Better business outcomes (faster time to market, reduced costs)
Applying These Lessons to Your Migration
You do not need to be Google-sized to benefit from these patterns. Extract what applies:
- Start with automation. Build the pipeline, the tests, the rollback mechanism.
- Adopt incrementally. Move from monthly to weekly to daily. Do not try to jump to 10 deploys per day on day one.
- Give teams ownership. Let teams deploy their own services.
- Use feature flags. Decouple deployment from release.
- Measure and improve. Track DORA metrics. Run experiments. Use retrospectives.
These are the practices covered throughout this migration guide. The experience reports confirm that they work - not in theory, but in production, at scale, in the real world.
Further Reading
For detailed experience reports and additional case studies, see:
- MinimumCD.org Experience Reports - Collected reports from organizations practicing minimum CD
- Accelerate by Nicole Forsgren, Jez Humble, and Gene Kim - The research behind DORA metrics, with extensive case study data
- Continuous Delivery by Jez Humble and David Farley - The foundational text, with detailed examples from multiple organizations
- The DevOps Handbook by Gene Kim, Jez Humble, Patrick Debois, and John Willis - Case studies from organizations across industries
This content is adapted from MinimumCD.org,
licensed under CC BY 4.0.
4 - CD for Greenfield Projects
Starting a new project? Build continuous delivery in from day one instead of retrofitting it later.
Starting with CD is dramatically easier than migrating to it. When there is no legacy process,
no existing test suite to fix, and no entrenched habits to change, you can build the right
practices from the first commit. This section shows you how.
Why Start with CD
Teams that build CD into a new project from the beginning avoid the most painful parts of the
migration journey. There is no test suite to rewrite, no branching strategy to unwind, no
deployment process to automate after the fact. Every practice described in this guide can be
adopted on day one when there is no existing codebase to constrain you.
The cost of adopting CD practices in a greenfield project is near zero. The cost of retrofitting
them into a mature codebase can be months of work. The earlier you start, the less it costs.
What to Build from Day One
Pipeline first
Before writing application code, set up your delivery pipeline. The pipeline is feature zero.
Your first commit should include:
- A build script that compiles, tests, and packages the application
- A CI configuration that runs on every push to trunk
- A deployment mechanism (even if the first “deployment” is to a local environment)
- Every validation you know you will need from the start
The validations you put in the pipeline on day one define the quality standard for the
application. They are not overhead you add later - they are the mold that shapes every line of
code that follows. If you add linting after 10,000 lines of code, you are fixing 10,000 lines of
code. If you add it before the first line, every line is written to the standard.
Feature zero validations:
- Code style and formatting - Enforce a formatter (Prettier, Black, gofmt) so style is
never a code review conversation. The pipeline rejects code that is not formatted.
- Linting - Static analysis rules for your language (ESLint, pylint, golangci-lint). Catches
bugs, enforces idioms, and prevents anti-patterns before review.
- Type checking - If your language supports static types (TypeScript, mypy, Java), enable
strict mode from the start. Relaxing later is easy. Tightening later is painful.
- Test framework - The test runner is configured and a first test exists, even if it only
asserts that the application starts. The team should never have to set up testing
infrastructure - it is already there.
- Security scanning - Dependency vulnerability scanning (Dependabot, Snyk, Trivy) and basic
SAST rules. Security findings block the build from day one, so the team never accumulates a
backlog of vulnerabilities.
- Commit message or PR conventions - If you enforce conventional commits, changelog
generation, or PR title formats, add the check now.
Every one of these is trivial to add to an empty project and expensive to retrofit into a mature
codebase. The pipeline enforces them automatically, so the team never has to argue about them in
review. The conversation shifts from “should we fix this?” to “the pipeline already enforces
this.”
The pipeline should exist before the first feature. Every feature you build will flow through it
and meet every standard you defined on day one.
Deploy “hello world” to production
Your first deployment should happen before your first feature. Deploy the simplest possible
application - a health check endpoint, a static page, a “hello world” - all the way to
production through your pipeline. This is the single most important validation you can do early
because it proves the entire path works: build, test, package, deploy, verify.
Why production, not staging: The goal is to prove the full path works end-to-end. If you
deploy only to a staging environment, you have proven that the pipeline works up to staging. You
have not proven that production credentials, network routes, DNS, load balancers, permissions,
and deployment targets are correctly configured. Every gap between your test environment and
production is an assumption that will be tested for the first time under pressure, when it
matters most.
Deploy “hello world” to production on day one, and you will discover:
- Whether the team has the access and permissions to deploy
- Whether the infrastructure provisioning actually works
- Whether the deployment mechanism handles a real production environment
- Whether monitoring and health checks are wired up correctly
- Whether rollback works before you need it in an emergency
All of these are problems you want to find with a “hello world,” not with a real feature under
a deadline.
Warning: deploying only to lower environments
If organizational constraints prevent you from deploying to production immediately, deploy as
close to production as you can. But be explicit about what this means: every environment that
is not production is an approximation. Lower environments may differ in network topology,
security policies, resource capacity, data volume, and third-party integrations. Each difference
is a gap in your confidence.
Track these gaps. Document every known difference between your deployment target and production.
Treat closing each gap as a priority, because until you have deployed to production through your
pipeline, you have not fully validated the path. The longer you wait, the more assumptions
accumulate, and the riskier the first real production deployment becomes.
Trunk-based development from the start
There is no reason to start with long-lived branches. From commit one:
- All work happens on trunk (or short-lived branches that merge to trunk within a day)
- The pipeline runs on every integration to trunk
- Trunk is always in a deployable state
See Trunk-Based Development for the practices.
Test architecture from the start
Design your test architecture before you have tests to migrate. Establish:
- Unit tests for all business logic
- Integration tests for every external boundary (databases, APIs, message queues)
- Functional tests that exercise your service in isolation with test doubles for dependencies
- Contract tests for every external dependency
- A clear rule: everything that blocks deployment is deterministic
See Testing Fundamentals for the full test architecture.
Small, vertical slices from the start
Decompose the first features into small, independently deployable increments. Establish the habit
of delivering thin vertical slices before the team has a chance to develop a batch mindset.
See Work Decomposition for slicing techniques.
Greenfield Checklist
Use this checklist to verify your new project is set up for CD from the start.
Week 1
Month 1
Month 3
Common Mistakes in Greenfield Projects
| Mistake |
Why it happens |
What to do instead |
| “We’ll add tests later” |
Pressure to show progress on features |
Write the first test before the first feature. TDD from day one. |
| “We’ll set up the pipeline later” |
Pipeline feels like overhead when there’s little code |
The pipeline is the first thing you build. Features flow through it. |
| Starting with feature branches |
Habit from previous projects |
Trunk-based development from commit one. No reason to start with branches. |
| Designing for scale before you have users |
Over-engineering from the start |
Build the simplest thing that works. Deploy frequently. Evolve the architecture based on real feedback. |
| Skipping contract tests because “we own both services” |
Feels redundant when one team owns everything |
You will not own everything forever. Contract tests are cheap to add early and expensive to add later. |
Related Content
5 - Defect Sources
A catalog of defect causes across the delivery value stream with detection methods, AI enhancement opportunities, and systemic fixes.
Adapted from AI Patterns: Defect Detection
Defects do not appear randomly. They originate from specific, predictable sources in the delivery
value stream. This reference catalogs those sources so teams can shift detection left, automate
where possible, and apply AI to accelerate the feedback loop.
Product & Discovery
These defects originate before a single line of code is written. They are the most expensive to
fix because they compound through every downstream phase.
| Defect Cause |
Detection Method |
AI Enhancement |
Fix |
| Building the wrong thing |
Adoption dashboards, user research validation |
Synthesize user feedback, support tickets, and usage data to surface misalignment earlier than production metrics |
Validated user research before backlog entry; dual-track agile |
| Solving a problem nobody has |
Problem validation stage gate, user interviews |
Analyze support tickets and feature requests to identify real vs. assumed pain points |
Problem validation as a stage gate; publish problem brief before solution |
| Correct problem, wrong solution |
Prototype testing, A/B experiments |
Compare proposed solution against prior approaches in similar domains |
Prototype multiple approaches; measurable success criteria first |
| Meets spec but misses user intent |
User acceptance testing, session recordings |
Review acceptance criteria against user behavior data to flag misalignment |
Acceptance criteria focused on user outcomes, not checklists |
| Over-engineering beyond need |
Code complexity metrics, architecture review |
Flag unnecessary abstraction layers and unused extension points |
YAGNI principle; justify every abstraction layer |
| Prioritizing wrong work |
Outcome tracking, opportunity scoring |
Automated WSJF scoring using historical outcome data |
WSJF prioritization with outcome data |
Integration & Boundaries
Defects at system boundaries are invisible to unit tests and often survive until production.
Contract testing and deliberate boundary design are the primary defenses.
| Defect Cause |
Detection Method |
AI Enhancement |
Fix |
| Interface mismatches |
Contract tests (Pact, OpenAPI, buf) |
Compare API schemas across versions to detect breaking changes before merge |
Mandatory contract tests per boundary; API-first with generated clients |
| Wrong assumptions about upstream/downstream |
Integration tests, behavioral contract documentation |
Analyze call patterns across services to detect undocumented behavioral expectations |
Document behavioral contracts; defensive coding at boundaries |
| Race conditions |
Thread sanitizers, concurrency testing |
Static analysis for concurrent access patterns; suggest idempotent alternatives |
Idempotent design; queues over shared mutable state |
| Inconsistent distributed state |
Distributed tracing (Jaeger, Zipkin), chaos engineering |
Anomaly detection across distributed state to flag synchronization failures |
Deliberate consistency model choices; saga with compensation logic |
Knowledge & Communication
These defects emerge from gaps between what people know and what the code expresses.
They are the hardest to detect with automated tools and the easiest to prevent with team practices.
| Defect Cause |
Detection Method |
AI Enhancement |
Fix |
| Implicit domain knowledge not in code |
Knowledge concentration metrics, code review |
Generate documentation from code and tests; flag where docs have drifted from implementation |
Domain-Driven Design with ubiquitous language; embed rules in code |
| Ambiguous requirements |
Three Amigos sessions, example mapping |
Review requirements for ambiguity, missing edge cases, and contradictions; generate test scenarios |
Three Amigos before work; example mapping; executable specs |
| Tribal knowledge loss |
Bus factor analysis, documentation coverage |
Identify knowledge silos by analyzing commit patterns and code ownership concentration |
Pair/mob programming as default; rotate on-call; living docs |
| Divergent mental models across teams |
Cross-team reviews, shared domain models |
Compare terminology and domain models across codebases to detect semantic mismatches |
Shared domain models; explicit bounded contexts |
Change & Complexity
These defects are caused by the act of changing existing code. The larger the change and the
longer it lives outside trunk, the higher the risk.
| Defect Cause |
Detection Method |
AI Enhancement |
Fix |
| Unintended side effects |
Mutation testing (Stryker, PIT), regression suites |
Automated blast radius analysis from change diffs; flag high-risk modifications |
Small focused commits; trunk-based development; feature flags |
| Accumulated technical debt |
Code complexity trends (CodeScene), static analysis |
Track complexity trends and predict which modules are approaching failure thresholds |
Refactoring as part of every story; dedicated debt budget |
| Unanticipated feature interactions |
Feature flag testing, canary deployments |
Analyze feature flag combinations to predict interaction conflicts |
Feature flags with controlled rollout; modular design; canary deployments |
| Configuration drift |
Infrastructure as code validation, environment diffing |
Detect configuration differences across environments automatically |
Infrastructure as code; immutable infrastructure; GitOps |
Testing & Observability Gaps
These defects survive because the safety net has holes. The fix is not more testing - it is
better-targeted testing and observability that closes the specific gaps.
| Defect Cause |
Detection Method |
AI Enhancement |
Fix |
| Untested edge cases and error paths |
Property-based testing (Hypothesis, fast-check), boundary analysis |
Generate edge case test scenarios from code analysis; identify untested paths |
Property-based testing as standard; boundary value analysis |
| Missing contract tests at boundaries |
Boundary inventory audit, integration failure analysis |
Scan service boundaries and flag missing contract test coverage |
Mandatory contract tests per new boundary |
| Insufficient monitoring |
SLO tracking, incident post-mortems |
Analyze production incidents to recommend missing monitoring and alerting |
Observability as non-functional requirement; SLOs for every user-facing path |
| Test environments don’t reflect production |
Environment parity checks, deployment failure analysis |
Compare environment configurations to flag meaningful differences |
Production-like data in staging; test in production with flags |
Process & Deployment
These defects are caused by the delivery process itself. Manual steps, large batches, and
slow feedback loops create the conditions for failure.
| Defect Cause |
Detection Method |
AI Enhancement |
Fix |
| Long-lived branches |
Branch age metrics, merge conflict frequency |
Flag branches exceeding age thresholds; predict merge conflict probability |
Trunk-based development; merge at least daily |
| Manual pipeline steps |
Value stream mapping, deployment audit |
Identify manual steps in the pipeline that can be automated |
Automate every step commit-to-production |
| Batching too many changes per release |
Deployment frequency metrics, change failure correlation |
Correlate batch size with failure rates to quantify the cost of large batches |
Continuous delivery; every commit is a candidate |
| Inadequate rollback capability |
Rollback testing, incident recovery time |
Automated risk scoring from change diff and deployment history |
Blue/green or canary deployments; auto-rollback on health failure |
| Reliance on human review to catch preventable defects |
Defect origin analysis, review effectiveness metrics |
Identify defects caught in review that could be caught by automated tools |
Reserve human review for knowledge transfer and design decisions |
| Manual review of risks and compliance (CAB) |
Change lead time analysis, CAB effectiveness metrics |
Automated change risk scoring to replace subjective risk assessment |
Replace CAB with automated progressive delivery |
Data & State
Data defects are particularly dangerous because they can corrupt persistent state. Unlike code
defects, data corruption often cannot be fixed by deploying a new version.
| Defect Cause |
Detection Method |
AI Enhancement |
Fix |
| Schema migration and backward compatibility failures |
Migration testing, schema version tracking |
Analyze schema changes for backward compatibility violations before merge |
Expand-then-contract schema migrations; never breaking changes |
| Null or missing data assumptions |
Null safety analysis (NullAway, TypeScript strict), property testing |
Static analysis for null safety; flag unhandled optional values |
Null-safe type systems; Option/Maybe as default; validate at boundaries |
| Concurrency and ordering issues |
Distributed tracing, idempotency testing |
Detect patterns vulnerable to out-of-order delivery |
Design for out-of-order delivery; idempotent consumers |
| Cache invalidation errors |
Cache hit/miss monitoring, stale data detection |
Analyze cache invalidation patterns and flag potential staleness windows |
Short TTLs; event-driven invalidation |
Dependency & Infrastructure
These defects originate outside your codebase but break your system. The fix is to treat
external dependencies as untrusted boundaries.
| Defect Cause |
Detection Method |
AI Enhancement |
Fix |
| Third-party library breaking changes |
Dependency scanning (Dependabot, Renovate), automated upgrade PRs |
Analyze changelog and API diffs to predict breaking impact before upgrade |
Pin dependencies; automated upgrade PRs with test gates |
| Infrastructure differences across environments |
Infrastructure as code validation, environment parity checks |
Compare infrastructure definitions across environments to flag drift |
Single source of truth for all environments; containerization |
| Network partitions and partial failures handled wrong |
Chaos engineering (Gremlin, Litmus), failure injection testing |
Analyze error handling code for missing failure modes |
Circuit breakers; retries; bulkheads as defaults; test failure modes explicitly |
From Reactive to Proactive
Systemic Thinking
The traditional approach to defects is reactive: wait for a bug, find it, fix it. The catalog
above enables a proactive approach: understand where defects originate, detect them at the
earliest possible point, and fix the systemic cause rather than the individual symptom.
AI enhances this shift by processing signals (code changes, test results, production metrics,
user feedback) faster and across more dimensions than manual analysis allows. But AI does not
replace the systemic fixes. Automated detection without process change just finds defects faster
without preventing them.
The goal is not zero defects. The goal is defects caught at the cheapest point in the value
stream, with systemic fixes that prevent the same category of defect from recurring.
Related Content
This content is adapted from AI Patterns: Defect Detection,
licensed under CC BY 4.0.
6 - AI Adoption Roadmap
A prescriptive guide for incorporating AI into your delivery process safely - remove friction and add safety before accelerating with AI coding.
Adapted from Incorporating AI Without Crashing
AI adoption is chaos testing for your organization. It does not create new problems - it reveals
existing ones. Teams that try to accelerate with AI before fixing their delivery process get the
same result as putting a bigger engine in a car with no way to steer or stop: you go faster,
briefly, and then something expensive happens. This page provides the prescriptive sequence for
incorporating AI safely, mirroring the brownfield migration phases.
The Key Insight
AI amplifies whatever system it is applied to. If your delivery process has strong guardrails,
fast feedback, and clear requirements, AI makes you faster. If your process has unclear
requirements, manual gates, fragile tests, and slow pipelines, AI makes those problems worse -
and it makes them worse faster.
The sequence matters: remove friction and add safety before you accelerate.
The Progression
graph LR
A["1. Quality Tools"] --> B["2. Clarify Work"]
B --> C["3. Harden Guardrails"]
C --> D["4. Remove Friction"]
D --> E["5. Accelerate with AI"]
style A fill:#e8f4fd,stroke:#1a73e8
style B fill:#e8f4fd,stroke:#1a73e8
style C fill:#fce8e6,stroke:#d93025
style D fill:#fce8e6,stroke:#d93025
style E fill:#e6f4ea,stroke:#137333
Each step builds on the previous one. Skipping steps means AI amplifies your weaknesses
instead of your strengths.
Brownfield phase: Assess
Before using AI for anything, choose models and tools that minimize hallucination and rework.
Not all AI tools are equal. A model that generates plausible-looking but incorrect code creates
more work than it saves.
What to do:
- Evaluate AI coding tools on accuracy, not speed. A tool that generates correct code 80% of
the time and incorrect code 20% of the time has a hidden rework tax on every use.
- Use models with strong reasoning capabilities for code generation. Smaller, faster models are
appropriate for autocomplete and suggestions, not for generating business logic.
- Establish a baseline: measure how much rework AI-generated code requires before and after
changing tools. If rework exceeds 20% of generated output, the tool is a net negative.
What this enables: A foundation of AI tooling that generates correct output more often than
not, so subsequent steps build on working code rather than compensating for broken code.
Step 2: Clarify Work Before Coding
Brownfield phase: Assess / Foundations
Use AI to improve requirements before code is written, not to write code from vague requirements.
Ambiguous requirements are the single largest source of defects
(see Defect Sources), and AI can detect ambiguity faster than
manual review.
What to do:
- Use AI to review tickets, user stories, and acceptance criteria before development begins.
Prompt it to identify gaps, contradictions, untestable statements, and missing edge cases.
- Use AI to generate test scenarios from requirements. If the AI cannot generate clear test
cases, the requirements are not clear enough for a human either.
- Use AI to analyze support tickets and incident reports for patterns that should inform
the backlog.
What this enables: Higher-quality inputs to the development process. Developers (human or AI)
start with clear, testable specifications rather than ambiguous descriptions that produce
ambiguous code.
Step 3: Harden Guardrails
Brownfield phase: Foundations / Pipeline
Before accelerating code generation, strengthen the safety net that catches mistakes. This means
both product guardrails (does the code work?) and development guardrails (is the code
maintainable?).
Product and operational guardrails:
- Automated test suites with meaningful coverage of critical paths
- Deterministic CI/CD pipelines that run on every commit
- Deployment validation (smoke tests, health checks, canary analysis)
Development guardrails:
- Code style enforcement (linters, formatters) that runs automatically
- Architecture rules (dependency constraints, module boundaries) enforced in the pipeline
- Security scanning (SAST, dependency vulnerability checks) on every commit
What to do:
- Audit your current guardrails. For each one, ask: “If AI generated code that violated this,
would our pipeline catch it?” If the answer is no, fix the guardrail before expanding AI use.
- Add contract tests at service boundaries. AI-generated code is
particularly prone to breaking implicit contracts between services.
- Ensure test suites run in minutes, not hours. Slow tests create pressure to skip them, which
is dangerous when code is generated faster.
What this enables: A safety net that catches mistakes regardless of who (or what) made them.
The pipeline becomes the authority on code quality, not human reviewers.
Step 4: Reduce Delivery Friction
Brownfield phase: Pipeline / Optimize
Remove the manual steps, slow processes, and fragile environments that limit how fast you can
safely deliver. These bottlenecks exist in every brownfield system and they become acute when AI
accelerates the code generation phase.
What to do:
- Remove manual approval gates that add wait time without adding safety
(see Replacing Manual Validations).
- Fix fragile test and staging environments that cause intermittent failures.
- Shorten branch lifetimes. If branches live longer than a day, integration pain will increase
as AI accelerates code generation.
- Automate deployment. If deploying requires a runbook or a specific person, it is a bottleneck
that will be exposed when code moves faster.
What this enables: A delivery pipeline where the time from “code complete” to “running in
production” is measured in minutes, not days. When this path is fast and reliable, AI-generated
code flows through the same pipeline as human-generated code with the same safety guarantees.
Step 5: Accelerate with AI Coding
Brownfield phase: Optimize / Continuous Deployment
Now - and only now - expand AI use to code generation, refactoring, and autonomous contributions.
The guardrails are in place. The pipeline is fast. Requirements are clear. The outcome of every
change is deterministic regardless of whether a human or an AI wrote it.
What to do:
- Use AI for code generation with the test-first workflow described in
Agentic CD. Write tests first, then let
AI generate the implementation.
- Use AI for refactoring: extracting interfaces, reducing complexity, improving test coverage.
These are high-value, low-risk tasks where AI excels.
- Use AI to analyze incidents and suggest fixes, with the same pipeline validation applied to
any change.
What this enables: AI-accelerated development where the speed increase translates to faster
delivery, not faster defect generation. The pipeline enforces the same quality bar regardless of
the author.
Mapping to Brownfield Phases
| AI Adoption Step |
Brownfield Phase |
Key Connection |
| 1. Quality Tools |
Assess |
Evaluate tooling as part of current-state assessment |
| 2. Clarify Work |
Assess / Foundations |
Better requirements feed better work decomposition |
| 3. Harden Guardrails |
Foundations / Pipeline |
Same testing and pipeline work, with AI-readiness as additional motivation |
| 4. Remove Friction |
Pipeline / Optimize |
Same automation and flow optimization, unblocking AI-speed delivery |
| 5. Accelerate with AI |
Optimize / CD |
AI coding becomes safe when the pipeline is deterministic and fast |
The Destination: Agentic CD
The end state of this roadmap is a delivery pipeline where AI agents can contribute code with the
same safety guarantees as human developers. This is
Agentic CD - the extension of
continuous deployment to handle agent-generated changes. You do not need to be at CD maturity to
start the AI adoption roadmap, but the roadmap leads there.
Related Content
This content is adapted from
Incorporating AI Without Crashing
by Bryan Finster.
7 - FAQ
Frequently asked questions about continuous delivery and this migration guide.
Adapted from MinimumCD.org
About This Guide
Why does this migration guide exist?
Many teams say they want to adopt continuous delivery but do not know where to start. The CD
landscape is full of tools, frameworks, and advice, but there is no clear, sequenced path from
“we deploy monthly” to “we can deploy any change at any time.” This guide provides that path.
It is built on the MinimumCD definition of continuous delivery and
draws on practices from the Dojo Consortium and the
DORA research. The content is organized as a migration – a phased journey
from your current state to continuous delivery – rather than as a description of what CD looks
like when you are already there.
Who is this guide for?
This guide is for development teams, tech leads, and engineering managers who want to improve
their software delivery practices. It is designed for teams that are currently deploying
infrequently (monthly, quarterly, or less) and want to reach a state where any change can be
deployed to production at any time.
You do not need to be starting from zero. If your team already has CI in place, you can begin
with Phase 2 – Pipeline. If you have a pipeline but deploy infrequently, start
with Phase 3 – Optimize. Use the Phase 0 assessment to find your
starting point.
Should we adopt this guide as an organization or as a team?
Start with a single team. CD adoption works best when a team can experiment, learn, and iterate
without waiting for organizational consensus. Once one team demonstrates results – shorter lead
times, lower change failure rate, more frequent deployments – other teams will have a concrete
example to follow.
Organizational adoption comes after team adoption, not before. The role of organizational
leadership is to create the conditions for teams to succeed: stable team composition, tool
funding, policy flexibility for deployment processes, and protection from pressure to cut
corners on quality.
How do we use this guide for improvement?
Start with Phase 0 – Assess. Map your value stream, measure your current
performance, and identify your top constraints. Then work through the phases in order, focusing
on one constraint at a time.
The guide is not a checklist to complete in sequence. It is a reference that helps you decide
what to work on next. Some teams will spend months in Phase 1 building testing fundamentals.
Others will move quickly to Phase 2 because they already have strong development practices.
Your value stream map and metrics tell you where to invest.
Revisit your assessment periodically. As you improve, new constraints will emerge. The phases
give you a framework for addressing them.
Continuous Delivery Concepts
What is the difference between continuous delivery and continuous deployment?
Continuous delivery means every change to the codebase is always in a deployable state and
can be released to production at any time through a fully automated pipeline. The decision to
deploy may still be made by a human, but the capability to deploy is always present.
Continuous deployment is an extension of continuous delivery where every change that passes
the automated pipeline is deployed to production without manual intervention.
This migration guide takes you through continuous delivery (Phases 0-3) and then to continuous
deployment (Phase 4). Continuous delivery is the prerequisite. You cannot safely automate
deployment decisions until your pipeline reliably determines what is deployable.
Is continuous delivery the same as having a CI/CD pipeline?
No. Many teams have a CI/CD pipeline tool (Jenkins, GitHub Actions, GitLab CI, etc.) but are
not practicing continuous delivery. A pipeline tool is necessary but not sufficient.
Continuous delivery requires:
- Trunk-based development – all developers integrating to trunk at least daily
- Comprehensive test automation – fast, reliable tests that catch real defects
- A single path to production – every change goes through the same automated pipeline
- Immutable artifacts – build once, deploy the same artifact everywhere
- The ability to deploy any green build – not just special “release” builds
If your team has a pipeline but uses long-lived feature branches, deploys only at the end of a
sprint, or requires manual testing before a release, you have a pipeline tool but you are not
practicing continuous delivery. The current-state checklist
in Phase 0 helps you assess the gap.
What does “the pipeline is the only path to production” mean?
It means there is exactly one way for any change to reach production: through the automated
pipeline. No one can SSH into a server and make a change. No one can skip the test suite for
an “urgent” fix. No one can deploy from their local machine.
This constraint is what gives you confidence. If every change in production has been through
the same build, test, and deployment process, you know what is running and how it got there.
If exceptions are allowed, you lose that guarantee, and your ability to reason about production
state degrades.
During your migration, establishing this single path is a key milestone in
Phase 2.
What does “application configuration” mean in the context of CD?
Application configuration refers to values that change between environments but are not part of
the application code: database connection strings, API endpoints, feature flag states, logging
levels, and similar settings.
In a CD pipeline, configuration is externalized – it lives outside the artifact and is injected
at deployment time. This is what makes immutable artifacts
possible. You build the artifact once and deploy it to any environment by providing the
appropriate configuration.
If configuration is embedded in the artifact (for example, hardcoded URLs or environment-specific
config files baked into a container image), you must rebuild the artifact for each environment,
which means the artifact you tested is not the artifact you deploy. This breaks the immutability
guarantee. See Application Config.
What is an “immutable artifact” and why does it matter?
An immutable artifact is a build output (container image, binary, package) that is never
modified after it is created. The exact artifact that passes your test suite is the exact
artifact that is deployed to staging, and then to production. Nothing is recompiled, repackaged,
or patched between environments.
This matters because it eliminates an entire category of deployment failures: “it worked in
staging but not in production” caused by differences in the build. If the same bytes are
deployed everywhere, build-related discrepancies are impossible.
Immutability requires externalizing configuration (see above) and storing artifacts in a
registry or repository. See Immutable Artifacts.
What does “deployable” mean?
A change is deployable when it has passed all automated quality gates defined in the pipeline.
The definition is codified in the pipeline itself, not decided by a person at deployment time.
A typical deployable definition includes:
- All unit tests pass
- All integration tests pass
- All functional tests pass
- Static analysis checks pass (linting, security scanning)
- The artifact is built and stored in the artifact registry
- Deployment to a production-like environment succeeds
- Smoke tests in the production-like environment pass
If any of these gates fail, the change is not deployable. The pipeline makes this determination
automatically and consistently. See Deployable Definition.
What is the difference between deployment and release?
Deployment is the act of putting code into a production environment.
Release is the act of making functionality available to users.
These are different events, and decoupling them is one of the most powerful techniques in CD.
You can deploy code to production without releasing it to users by using
feature flags. The code is running in production, but the new
functionality is disabled. When you are ready, you enable the flag and the feature is released.
This decoupling is important because it separates the technical risk (will the deployment
succeed?) from the business risk (will users like the feature?). You can manage each risk
independently. Deployments become routine technical events. Releases become deliberate business
decisions.
Migration Questions
How long does the migration take?
It depends on where you start and how much organizational support you have. As a rough guide:
- Phase 0 (Assess): 1-2 weeks
- Phase 1 (Foundations): 1-6 months, depending on current testing and TBD maturity
- Phase 2 (Pipeline): 1-3 months
- Phase 3 (Optimize): 2-6 months
- Phase 4 (Deliver on Demand): 1-3 months
These ranges assume a single team working on the migration alongside regular delivery work.
The biggest variable is Phase 1: teams with no test automation or TBD practice will spend
longer building foundations than teams that already have these in place.
Do not treat these timelines as commitments. The migration is an iterative improvement process,
not a project with a deadline.
Do we stop delivering features during the migration?
No. The migration is done alongside regular delivery work, not instead of it. Each migration
practice is adopted incrementally: you do not stop the world to rewrite your test suite or
redesign your pipeline.
For example, in Phase 1 you adopt trunk-based development by reducing branch lifetimes
gradually – from two weeks to one week to two days to same-day. You add automated tests
incrementally, starting with the highest-risk code paths. You decompose work into smaller
stories one sprint at a time.
The migration practices themselves improve your delivery speed, so the investment pays off
as you go. Teams that have completed Phase 1 typically report delivering features faster than
before, not slower.
What if our organization requires manual change approval (CAB)?
Many organizations have Change Advisory Board (CAB) processes that require manual approval
before production deployments. This is one of the most common organizational blockers for CD.
The path forward is to replace the manual approval with automated evidence. A CAB exists
because the organization lacks confidence that changes are safe. Your CD pipeline, when mature,
provides stronger evidence of safety than a committee meeting:
- Every change has passed comprehensive automated tests
- The exact artifact that was tested is the one being deployed
- Rollback is automated and takes minutes
- Deployment is a routine event that happens many times per week
Use your DORA metrics to demonstrate that automated pipelines produce lower change failure
rates than manual approval processes. Most CAB processes were designed for a world of monthly
releases with hundreds of changes per batch. When you deploy daily with one or two changes per
deployment, the risk profile is fundamentally different.
This is a gradual conversation, not a one-time negotiation. Start by inviting CAB
representatives to observe your pipeline. Show them the test results, the deployment logs,
the rollback capability. Build trust through evidence.
What if we have a monolithic architecture?
You can practice continuous delivery with a monolith. CD does not require microservices. Many
of the highest-performing teams in the DORA research deploy monolithic applications multiple
times per day.
What matters is that your architecture supports independent testing and deployment. A
well-structured monolith with a comprehensive test suite and a reliable pipeline can achieve
CD. A poorly structured collection of microservices with shared databases and coordinated
releases cannot.
Architecture decoupling is addressed in Phase 3, but
it is about enabling independent deployment and reducing coordination costs, not about adopting
any particular architectural style.
What if our tests are slow or unreliable?
This is one of the most common starting conditions. A slow or flaky test suite undermines
every CD practice: developers stop trusting the tests, broken builds are ignored, and the
pipeline becomes a bottleneck rather than an enabler.
The solution is incremental, not wholesale:
- Delete or quarantine flaky tests. A test that sometimes passes and sometimes fails
provides no signal. Remove it from the pipeline and fix it or replace it.
- Parallelize what you can. Many test suites are slow because they run sequentially.
Parallelization is often the fastest way to reduce pipeline duration.
- Rebalance the test pyramid. If most of your automated tests are end-to-end or UI
tests, they will be slow and brittle. Invest in unit and integration tests that run in
milliseconds and reserve end-to-end tests for critical paths only.
- Set a time budget. Your full pipeline – build, test, deploy to a staging environment
– should complete in under 10 minutes. If it takes longer, that is a constraint to address.
See Testing Fundamentals and the
Testing reference section for detailed guidance.
Where do I start if I am not sure which phase applies to us?
Start with Phase 0 – Assess. Complete the
value stream mapping exercise, take
baseline metrics, and fill out the
current-state checklist. These activities will tell you
exactly where you stand and which phase to begin with.
If you do not have time for a full assessment, ask yourself these questions:
- Do all developers integrate to trunk at least daily? If no, start with Phase 1.
- Do you have a single automated pipeline that every change goes through? If no, start with Phase 2.
- Can you deploy any green build to production on demand? If no, focus on the gap between your current state and Phase 2 completion criteria.
- Do you deploy at least weekly? If no, look at Phase 3 for batch size and flow optimization.
This content is adapted from MinimumCD.org,
licensed under CC BY 4.0.
8 - Reference
Supporting material: glossary, metrics definitions, testing guides, and additional resources.
This section provides reference material that supports your migration journey.
Use it alongside the phase guides for detailed definitions, metrics, and patterns.
Contents
- Glossary - Key terms and definitions
- CD Dependency Tree - How CD practices depend on each other
- Common Blockers - Frequently encountered obstacles and how to address them
- Defect Sources - Defect causes across the delivery value stream with detection methods and AI enhancements
- DORA Capabilities - The capabilities that drive software delivery performance
- Resources - Books, videos, and further reading
- Metrics - Detailed definitions for key delivery metrics
- Testing - Testing types, patterns, and best practices
8.1 - Glossary
Key terms and definitions used throughout this guide.
This glossary defines the terms used across every phase of the CD migration guide. Where a term
has a specific meaning within a migration phase, the relevant phase is noted.
A
Artifact
A packaged, versioned output of a build process (e.g., a container image, JAR file, or binary).
In a CD pipeline, artifacts are built once and promoted through environments without
modification. See Immutable Artifacts.
B
Baseline Metrics
The set of delivery measurements taken before beginning a migration, used as the benchmark
against which improvement is tracked. See Phase 0 – Baseline Metrics.
Batch Size
The amount of change included in a single deployment. Smaller batches reduce risk, simplify
debugging, and shorten feedback loops. Reducing batch size is a core focus of
Phase 3 – Small Batches.
BDD (Behavior-Driven Development)
A collaboration practice where developers, testers, and product representatives define expected
behavior using structured examples before code is written. BDD produces executable
specifications that serve as both documentation and automated tests. BDD supports effective
work decomposition by forcing clarity about what a
story actually means before development begins.
Blue-Green Deployment
A deployment strategy that maintains two identical production environments. New code is deployed
to the inactive environment, verified, and then traffic is switched. See
Progressive Rollout.
Branch Lifetime
The elapsed time between creating a branch and merging it to trunk. CD requires branch lifetimes
measured in hours, not days or weeks. Long branch lifetimes are a symptom of poor work
decomposition or slow code review. See Trunk-Based Development.
C
Canary Deployment
A deployment strategy where a new version is rolled out to a small subset of users or servers
before full rollout. If the canary shows no issues, the deployment proceeds to 100%. See
Progressive Rollout.
CD (Continuous Delivery)
The practice of ensuring that every change to the codebase is always in a deployable state and
can be released to production at any time through a fully automated pipeline. Continuous
delivery does not require that every change is deployed automatically, but it requires that
every change could be deployed automatically. This is the primary goal of this migration
guide.
Change Failure Rate (CFR)
The percentage of deployments to production that result in a degraded service and require
remediation (e.g., rollback, hotfix, or patch). One of the four DORA metrics. See
Metrics – Change Fail Rate.
CI (Continuous Integration)
The practice of integrating code changes to a shared trunk at least once per day, where each
integration is verified by an automated build and test suite. CI is a prerequisite for CD, not
a synonym. A team that runs automated builds on feature branches but merges weekly is not doing
CI. See Build Automation.
Constraint
In the Theory of Constraints, the single factor most limiting the throughput of a system.
During a CD migration, your job is to find and fix constraints in order of impact. See
Identify Constraints.
Continuous Deployment
An extension of continuous delivery where every change that passes the automated pipeline is
deployed to production without manual intervention. Continuous delivery ensures every change
can be deployed; continuous deployment ensures every change is deployed. See
Phase 4 – Deliver on Demand.
D
Deployable
A change that has passed all automated quality gates defined by the team and is ready for
production deployment. The definition of deployable is codified in the pipeline, not decided
by a person at deployment time. See Deployable Definition.
Deployment Frequency
How often an organization successfully deploys to production. One of the four DORA metrics.
See Metrics – Release Frequency.
Development Cycle Time
The elapsed time from the first commit on a change to that change being deployable. This
measures the efficiency of your development and pipeline process, excluding upstream wait times.
See Metrics – Development Cycle Time.
DORA Metrics
The four key metrics identified by the DORA (DevOps Research and Assessment) research program
as predictive of software delivery performance: deployment frequency, lead time for changes,
change failure rate, and mean time to restore service. See DORA Capabilities.
F
Feature Flag
A mechanism that allows code to be deployed to production with new functionality disabled,
then selectively enabled for specific users, percentages of traffic, or environments. Feature
flags decouple deployment from release. See Feature Flags.
Flow Efficiency
The ratio of active work time to total elapsed time in a delivery process. A flow efficiency of
15% means that for every hour of actual work, roughly 5.7 hours are spent waiting. Value stream
mapping reveals your flow efficiency. See Value Stream Mapping.
H
Hard Dependency
A dependency that must be resolved before work can proceed. In delivery, hard dependencies
include things like waiting for another team’s API, a shared database migration, or an
infrastructure provisioning request. Hard dependencies create queues and increase lead time.
Eliminating hard dependencies is a focus of
Architecture Decoupling.
Hardening Sprint
A sprint dedicated to stabilizing and fixing defects before a release. The existence of
hardening sprints is a strong signal that quality is not being built in during regular
development. Teams practicing CD do not need hardening sprints because every commit is
deployable. See Common Blockers.
I
Immutable Artifact
A build artifact that is never modified after creation. The same artifact that is tested in the
pipeline is the exact artifact that is deployed to production. Configuration differences between
environments are handled externally. See Immutable Artifacts.
Integration Frequency
How often a developer integrates code to the shared trunk. CD requires at least daily
integration. See Metrics – Integration Frequency.
L
Lead Time for Changes
The elapsed time from when a commit is made to when it is successfully running in production.
One of the four DORA metrics. See Metrics – Lead Time.
M
Mean Time to Restore (MTTR)
The elapsed time from when a production incident is detected to when service is restored. One
of the four DORA metrics. Teams practicing CD have short MTTR because deployments are small,
rollback is automated, and the cause of failure is easy to identify. See
Metrics – Mean Time to Repair.
P
Pipeline
The automated sequence of build, test, and deployment stages that every change passes through
on its way to production. See Phase 2 – Pipeline.
Production-Like Environment
A test or staging environment that matches production in configuration, infrastructure, and
data characteristics. Testing in environments that differ from production is a common source
of deployment failures. See Production-Like Environments.
R
Rollback
The ability to revert a production deployment to a previous known-good state. CD requires
automated rollback that takes minutes, not hours. See Rollback.
S
Soft Dependency
A dependency that can be worked around or deferred. Unlike hard dependencies, soft dependencies
do not block work but may influence sequencing or design decisions. Feature flags can turn many
hard dependencies into soft dependencies by allowing incomplete integrations to be deployed in
a disabled state.
Story Points
A relative estimation unit used by some teams to forecast effort. Story points are frequently
misused as a productivity metric, which creates perverse incentives to inflate estimates and
discourages the small work decomposition that CD requires. If your organization uses story
points as a velocity target, see Common Blockers.
T
TBD (Trunk-Based Development)
A source-control branching model where all developers integrate to a single shared branch
(trunk) at least once per day. Short-lived feature branches (less than a day) are acceptable.
Long-lived feature branches are not. TBD is a prerequisite for CI, which is in turn a
prerequisite for CD. See Trunk-Based Development.
TDD (Test-Driven Development)
A development practice where tests are written before the production code that makes them
pass. TDD supports CD by ensuring high test coverage, driving simple design, and producing
a fast, reliable test suite. TDD feeds into the testing fundamentals
required in Phase 1.
Toil
Repetitive, manual work related to maintaining a production service that is automatable, has
no lasting value, and scales linearly with service size. Examples include manual deployments,
manual environment provisioning, and manual test execution. Eliminating toil is a primary
benefit of building a CD pipeline.
U
Unplanned Work
Work that arrives outside the planned backlog – production incidents, urgent bug fixes,
ad hoc requests. High levels of unplanned work indicate systemic quality or operational
problems. Teams with high change failure rates generate their own unplanned work through
failed deployments. Reducing unplanned work is a natural outcome of improving change failure
rate through CD practices.
V
Value Stream Map
A visual representation of every step required to deliver a change from request to production,
showing process time, wait time, and percent complete and accurate at each step. The
foundational tool for Phase 0 – Assess.
Vertical Sliced Story
A user story that delivers a thin slice of functionality across all layers of the system
(UI, API, database, etc.) rather than a horizontal slice that implements one layer completely.
Vertical slices are independently deployable and testable, which is essential for CD. Vertical
slicing is a core technique in Work Decomposition.
W
WIP (Work in Progress)
The number of work items that have been started but not yet completed. High WIP increases lead
time, reduces focus, and increases context-switching overhead. Limiting WIP is a key practice
in Phase 3 – Limiting WIP.
Working Agreement
An explicit, documented set of team norms covering how work is defined, reviewed, tested, and
deployed. Working agreements create shared expectations and reduce friction. See
Working Agreements.
This content is adapted from the Dojo Consortium,
licensed under CC BY 4.0.
8.2 - CD Dependency Tree
Visual guide showing how CD practices depend on and build upon each other.
Continuous delivery is not a single practice you adopt. It is a system of interdependent
practices where each one supports and enables others. This dependency tree shows those
relationships. Understanding the dependencies helps you plan your migration in the right
order – addressing foundational practices before building on them.
The Dependency Tree
The diagram below shows how the core practices of CD relate to each other. Read it from
bottom to top: lower practices enable higher ones. The migration phases in this guide are
sequenced to follow these dependencies.
graph BT
subgraph "Goal"
CD["Continuous Delivery"]
end
subgraph "Continuous Integration"
CI["Continuous Integration"]
end
subgraph "Development Practices"
TBD["Trunk-Based Development"]
TDD["Test-Driven Development"]
BDD["Behavior-Driven Development"]
WD["Work Decomposition"]
CR["Code Review"]
end
subgraph "Build & Test Infrastructure"
BA["Build Automation"]
TS["Test Suite"]
PLEnv["Production-Like Environments"]
end
subgraph "Pipeline Practices"
SPP["Single Path to Production"]
DP["Deterministic Pipeline"]
IA["Immutable Artifacts"]
AC["Application Config"]
RB["Rollback"]
DD["Deployable Definition"]
end
subgraph "Flow Optimization"
SB["Small Batches"]
FF["Feature Flags"]
WIP["WIP Limits"]
MDI["Metrics-Driven Improvement"]
end
subgraph "Organizational Practices"
WA["Working Agreements"]
Retro["Retrospectives"]
AD["Architecture Decoupling"]
end
%% Development Practices feed CI
TDD --> CI
BDD --> TDD
BDD --> WD
TBD --> CI
WD --> SB
CR --> TBD
%% Build infrastructure feeds CI
BA --> CI
TS --> CI
TDD --> TS
%% CI feeds pipeline
CI --> SPP
CI --> DP
PLEnv --> DP
%% Pipeline practices feed CD
SPP --> CD
DP --> CD
IA --> CD
AC --> IA
RB --> CD
DD --> CD
%% Flow optimization feeds CD
SB --> CD
FF --> SB
FF --> CD
WIP --> SB
MDI --> CD
%% Organizational practices support everything
WA --> CR
WA --> DD
Retro --> MDI
AD --> FF
AD --> SB
How to Read the Dependency Tree
Each arrow means “supports” or “enables.” When practice A has an arrow pointing to practice B,
it means A is a prerequisite or enabler for B.
Key dependency chains to understand:
BDD enables TDD enables CI enables CD
Behavior-Driven Development produces clear, testable acceptance criteria. Those criteria drive
Test-Driven Development at the code level. A comprehensive, fast test suite enables
Continuous Integration with confidence. And CI is the foundational prerequisite for CD.
If your team skips BDD, stories are ambiguous. If stories are ambiguous, tests are incomplete
or wrong. If tests are unreliable, CI is unreliable. And if CI is unreliable, CD is impossible.
Work Decomposition enables Small Batches enables CD
You cannot deploy small batches if your work items are large. Work decomposition – breaking
features into vertical slices that can each be completed in
two days or less – is what makes small batches possible. Small batches in turn reduce
deployment risk and enable the rapid feedback that CD depends on.
Trunk-Based Development enables CI
CI requires that all developers integrate to a shared trunk at least once per day. If your team
uses long-lived feature branches, you are not doing CI regardless of how often your build server
runs. TBD is not optional for CD – it is a prerequisite.
Architecture Decoupling enables Feature Flags and Small Batches
Tightly coupled architectures force coordinated deployments. When changing service A requires
simultaneously changing services B and C, small independent deployments become impossible.
Architecture decoupling – through well-defined APIs, contract testing, and service boundaries
– enables teams to deploy independently, use feature flags effectively, and maintain small
batch sizes.
Mapping to Migration Phases
The dependency tree directly informs the sequencing of migration phases:
| Dependency Layer |
Migration Phase |
Why This Order |
| Development practices (TBD, TDD, BDD, work decomposition, code review) |
Phase 1 – Foundations |
These are prerequisites for CI, which is a prerequisite for everything else |
| Build and test infrastructure (build automation, test suite, production-like environments) |
Phase 1 and Phase 2 |
You need a reliable build and test infrastructure before you can build a reliable pipeline |
| Pipeline practices (single path, deterministic pipeline, immutable artifacts, config, rollback) |
Phase 2 – Pipeline |
The pipeline depends on solid CI and development practices |
| Flow optimization (small batches, feature flags, WIP limits, metrics) |
Phase 3 – Optimize |
Optimization requires a working pipeline to optimize |
| Organizational practices (working agreements, retrospectives, architecture decoupling) |
All phases |
These cross-cutting practices support every phase and should be established early |
Using the Tree to Diagnose Problems
When something in your delivery process is not working, trace it through the dependency tree
to find the root cause.
Example 1: Deployments keep failing.
Look at what feeds CD in the tree. Is your pipeline deterministic? Are you using immutable
artifacts? Is your application config externalized? The failure is likely in one of the
pipeline practices.
Example 2: CI builds are constantly broken.
Look at what feeds CI. Are developers actually practicing TBD (integrating daily)? Is the test
suite reliable, or is it full of flaky tests? Is the build automated end-to-end? The broken
builds are a symptom of a problem in the development practices layer.
Example 3: You cannot reduce batch size.
Look at what feeds small batches. Is work being decomposed into vertical slices? Are feature
flags available so partial work can be deployed safely? Is the architecture decoupled enough
to allow independent deployment? The batch size problem originates in one of these upstream
practices.
Migration Tip
When you encounter a problem, resist the urge to fix the symptom. Use the dependency tree to
trace the problem to its root cause. Fixing the symptom (for example, adding more manual
testing to catch deployment failures) will not solve the underlying issue and often adds
toil that makes things worse. Fix the dependency that is broken, and the downstream problem
resolves itself.
Practices Not Shown
The tree above focuses on the core technical and process practices. Several important
supporting practices are not shown for clarity but are covered elsewhere in this guide:
- Observability and monitoring – essential for progressive rollout and fast incident response
- Security automation – integrated into the pipeline as automated checks rather than manual gates
- Database change management – a common constraint addressed during pipeline architecture
- Team topology and organizational design – addressed through working agreements and architectural decoupling
This content is adapted from the Dojo Consortium,
licensed under CC BY 4.0.
8.3 - Common Blockers
Frequently encountered obstacles on the path to CD and how to address them.
Every team migrating to continuous delivery will encounter obstacles. Some are technical. Most
are not. The blockers listed here are drawn from patterns observed across hundreds of teams
attempting the journey to CD. Recognizing them early helps you address root causes rather than
fight symptoms.
Work Breakdown Problems
Stories Too Large
What it looks like: User stories regularly take more than a week to complete. Developers
work on a single story for days without integrating. Sprint commitments are frequently missed
because “the story was bigger than we thought.”
Why it blocks CD: Large stories mean large batches. Large batches mean infrequent
integration. Infrequent integration means painful merges, delayed feedback, and high-risk
deployments. You cannot practice continuous integration – the prerequisite for CD – if your
work items take a week.
What to do: Adopt vertical slicing. Every story should deliver a thin slice of user-visible
functionality across all layers of the system. Target a maximum of two days from start to
done. See Work Decomposition.
No Vertical Slicing
What it looks like: Stories are organized by technical layer (“build the API,” “build the
database schema,” “build the UI”) rather than by user-visible behavior. Multiple stories must
be completed before anything is demonstrable or testable end-to-end.
Why it blocks CD: Horizontal slices cannot be independently deployed or tested. They create
hard dependencies between stories and teams. Nothing is deployable until all layers are
assembled, which forces large-batch releases.
What to do: Rewrite stories as vertical slices that deliver end-to-end functionality,
even if the initial slice is minimal. A single form field that saves to the database and
displays a confirmation is a vertical slice. An entire database schema with no UI is not.
Team Workflow Problems
Too Much Work in Progress
What it looks like: Every developer is working on a different story. The team has 8 items
in progress and 0 items done. Standup meetings are long because everyone has a different
context to report on. Nothing is finished, but everything is started.
Why it blocks CD: High WIP destroys flow. When everything is in progress, nothing gets the
focused attention needed to finish. Context switching between items adds overhead. The
delivery pipeline sees sporadic, large commits rather than a steady stream of small ones.
What to do: Set explicit WIP limits. A team of 6 developers should have no more than 3-4
items in progress at any time. The goal is to finish work, not to start it. See
Limiting WIP.
Distant Date Commitments
What it looks like: The team has committed to delivering a specific scope by a date months
in the future. The commitment was made before the work was understood. Progress is tracked
against the original plan, and “falling behind” triggers pressure to cut corners.
Why it blocks CD: Fixed-scope, fixed-date commitments incentivize large batches. Teams
hoard changes until the deadline, then deploy everything at once. There is no incentive to
deliver incrementally because the commitment is about the whole scope, not about continuous
flow. When the deadline pressure mounts, testing is the first thing cut.
What to do: Shift to continuous delivery of small increments. Report progress by showing
working software in production, not by comparing actuals to a Gantt chart. If date commitments
are required by the organization, negotiate on scope rather than on quality.
Velocity Used as a Productivity Metric
What it looks like: Management tracks story points completed per sprint as a measure of
team productivity. Teams are compared by velocity. There is pressure to increase velocity
every sprint.
Why it blocks CD: When velocity is a target, it ceases to be a useful measure (Goodhart’s
Law). Teams inflate estimates to look productive. Stories get larger because larger stories
have more points. The incentive is to maximize points, not to deliver small, frequent, valuable
changes to production.
What to do: Replace velocity with DORA metrics – deployment
frequency, lead time, change failure rate, and mean time to restore. These measure delivery
outcomes rather than output volume.
Manual Testing Gates
Hardening Sprints
What it looks like: The team allocates one or more sprints after “feature complete” to
stabilize, fix bugs, and prepare for release. Code is frozen during hardening. Testers run
manual regression suites. Bug counts are tracked on a burndown chart.
Why it blocks CD: A hardening sprint is an admission that the normal development process
does not produce deployable software. If you need a dedicated period to make code
production-ready, you are not continuously delivering – you are doing waterfall with shorter
phases. Hardening sprints add weeks of delay and encourage teams to accumulate technical debt
during feature sprints because “we’ll fix it in hardening.”
What to do: Eliminate the need for hardening by building quality in. Adopt TDD to ensure
test coverage. Use a CI pipeline that runs the full test suite on every commit. Define
“deployable” as an automated pipeline outcome, not as a manual assessment. See
Testing Fundamentals and
Deployable Definition.
Manual Regression Testing
What it looks like: Every release requires a manual regression test cycle that takes days
or weeks. Testers execute scripted test cases against the application. New features are tested
manually before they are considered done.
Why it blocks CD: Manual regression testing scales linearly with application size and
inversely with delivery frequency. The more features you add, the longer regression takes.
The longer regression takes, the less frequently you can deploy. This is the opposite of CD.
What to do: Automate regression tests. Not all at once – start with the highest-risk
areas and the tests that block deployments most frequently. Your automated test suite should
give you the same confidence as manual regression, but in minutes rather than days. See
Testing Fundamentals.
Organizational Anti-Patterns
Meaningless Retrospectives
What it looks like: Retrospectives happen on schedule, but action items are never
completed. The same problems surface every sprint. The team has stopped believing that
retrospectives lead to change.
Why it blocks CD: CD requires continuous improvement. If the mechanism for identifying and
addressing process problems is broken, systemic issues accumulate. The same blockers will
persist indefinitely.
What to do: Limit retrospective action items to one or two per sprint and track them as
work items with the same visibility as feature work. Make the action items specific and
completable. “Improve testing” is not an action item. “Automate the login flow regression
test” is. See Retrospectives.
Team Instability
What it looks like: Team members are frequently reassigned to other projects. New people
join and leave every few sprints. The team never builds shared context or working agreements.
Why it blocks CD: CD practices depend on team discipline and shared understanding. TBD
requires trust between developers. Code review speed depends on familiarity with the codebase.
Working agreements require a stable group to establish and maintain. Constantly reshuffling
teams means constantly restarting the journey.
What to do: Advocate for stable, long-lived teams. The team should own a product or service
for its full lifecycle, not be assembled for a project and disbanded when it ends.
One Delivery per Sprint
What it looks like: The team delivers to production once per sprint, typically at the end.
All stories from the sprint are bundled into a single release. The “sprint demo” is the first
time stakeholders see working software.
Why it blocks CD: One delivery per sprint is not continuous delivery. It is a two-week batch
release with Agile terminology. If something breaks in the batch, any of the changes could be
the cause. Rollback means losing the entire sprint’s work. Feedback is delayed by weeks.
What to do: Start deploying individual stories as they are completed, not at the end of
the sprint. This requires a working CI pipeline, trunk-based development, and the ability to
deploy independently. These are the outcomes of Phase 1 and
Phase 2.
Anti-Patterns Summary
The table below maps each common blocker to its root cause and the migration phase that
addresses it.
Where to Start
If you recognize many of these blockers in your team, do not try to address them all at once.
Use the CD Dependency Tree to understand which practices are
prerequisite to others, and use your value stream map
to identify which blocker is the current constraint. Fix the biggest constraint first, then
move to the next.
This content is adapted from the Dojo Consortium,
licensed under CC BY 4.0.
8.4 - DORA Capabilities
The capabilities that drive software delivery performance, as identified by DORA research.
The DevOps Research and Assessment (DORA) research program has identified capabilities that
predict high software delivery performance. These capabilities are not tools or technologies
– they are practices and cultural conditions that enable teams to deliver software quickly,
reliably, and sustainably.
This page organizes the DORA capabilities by their relevance to each migration phase. Use it
as a reference to understand which capabilities you are building at each stage of your journey
and which ones to focus on next.
Continuous Delivery Capabilities
These capabilities directly support the mechanics of getting software from commit to production.
They are the primary focus of Phases 1 and 2 of the migration.
Version Control
All production artifacts – application code, test code, infrastructure configuration,
deployment scripts, and database schemas – are stored in version control and can be
reproduced from a single source of truth.
Migration relevance: This is a prerequisite for Phase 1. If any part of your delivery
process depends on files stored on a specific person’s machine or a shared drive, address that
before beginning the migration.
Continuous Integration
Developers integrate their work to trunk at least daily. Each integration triggers an
automated build and test process. Broken builds are fixed within minutes.
Migration relevance: Phase 1 – Foundations. CI is the gateway
capability. Without it, none of the pipeline practices in Phase 2 can function. See
Build Automation and
Trunk-Based Development.
Deployment Automation
Deployments are fully automated and can be triggered by anyone on the team. No manual steps
are required between a green pipeline and production.
Migration relevance: Phase 2 – Pipeline. Specifically,
Single Path to Production and
Rollback.
Trunk-Based Development
Developers work in small batches and merge to trunk at least daily. Branches, if used, are
short-lived (less than one day). There are no long-lived feature branches.
Migration relevance: Phase 1 – Trunk-Based Development.
This is one of the first capabilities to establish because it enables CI.
Test Automation
A comprehensive suite of automated tests provides confidence that the software is deployable.
Tests are reliable, fast, and maintained as carefully as production code.
Migration relevance: Phase 1 – Testing Fundamentals.
Also see the Testing reference section for guidance on specific test types.
Test Data Management
Test data is managed in a way that allows automated tests to run independently, repeatably,
and without relying on shared mutable state. Tests can create and clean up their own data.
Migration relevance: Becomes critical during Phase 2 when you need
production-like environments and deterministic pipeline results.
Shift Left on Security
Security is integrated into the development process rather than added as a gate at the end.
Automated security checks run in the pipeline. Security requirements are part of the
definition of deployable.
Migration relevance: Integrated during Phase 2 – Pipeline Architecture
as automated quality gates rather than manual review steps.
Architecture Capabilities
These capabilities address the structural characteristics of your system that enable or prevent
independent, frequent deployment.
Loosely Coupled Architecture
Teams can deploy their services independently without coordinating with other teams. Changes
to one service do not require changes to other services. APIs have well-defined contracts.
Migration relevance: Phase 3 – Architecture Decoupling.
This capability becomes critical when optimizing for deployment frequency and small batch
sizes.
Empowered Teams
Teams choose their own tools, technologies, and approaches within organizational guardrails.
They do not need approval from a central architecture board for implementation decisions.
Migration relevance: All phases. Teams that cannot make local decisions about their
pipeline, test strategy, or deployment approach will be unable to iterate quickly enough
to make progress.
Product and Process Capabilities
These capabilities address how work is planned, prioritized, and delivered.
Customer Feedback
Product decisions are informed by direct feedback from customers. Teams can observe how
features are used in production and adjust accordingly.
Migration relevance: Becomes fully enabled in Phase 4 – Deliver on Demand
when every change reaches production quickly enough for real customer feedback to inform
the next change.
Value Stream Visibility
The team has a clear view of the entire delivery process from request to production, including
wait times, handoffs, and rework loops.
Migration relevance: Phase 0 – Value Stream Mapping.
This is the first activity in the migration because it informs every decision that follows.
Working in Small Batches
Work is broken down into small increments that can be completed, tested, and deployed
independently. Each increment delivers measurable value or validated learning.
Migration relevance: Begins in Phase 1 – Work Decomposition
and is optimized in Phase 3 – Small Batches.
Team Experimentation
Teams can try new ideas, tools, and approaches without requiring approval through a lengthy
process. Failed experiments are treated as learning, not as waste.
Migration relevance: All phases. The migration itself is an experiment. Teams need the
psychological safety and organizational support to try new practices, fail occasionally, and
adjust.
Lean Management Capabilities
These capabilities address how work is managed, measured, and improved.
Limit Work in Progress
Teams have explicit WIP limits that constrain the number of items in any stage of the delivery
process. WIP limits are enforced and respected.
Migration relevance: Phase 3 – Limiting WIP. Reducing WIP
is one of the most effective ways to improve lead time and delivery predictability.
Visual Management
The state of all work is visible to the entire team through dashboards, boards, or other
visual tools. Anyone can see what is in progress, what is blocked, and what has been deployed.
Migration relevance: All phases. Visual management supports the identification of
constraints in Phase 0 and the enforcement of WIP limits in Phase 3.
Monitoring and Observability
Teams have access to production metrics, logs, and traces that allow them to understand system
behavior, detect issues, and diagnose problems quickly.
Migration relevance: Critical for Phase 4 – Progressive Rollout
where automated health checks determine whether a deployment proceeds or rolls back. Also
supports fast mean time to restore.
Proactive Notification
Teams are alerted to problems before customers are affected. Monitoring thresholds and
anomaly detection trigger notifications that enable rapid response.
Migration relevance: Becomes critical in Phase 4 when deployments are continuous and
automated. Proactive notification is what makes continuous deployment safe.
Cultural Capabilities
These capabilities address the human and organizational conditions that enable high performance.
Generative Culture
Following Ron Westrum’s organizational typology, a generative culture is characterized by
high cooperation, shared risk, and a focus on the mission. Messengers are not punished.
Failures are treated as learning opportunities. New ideas are welcomed.
Migration relevance: All phases. A generative culture is not a phase you implement – it
is a condition you cultivate continuously. Teams in pathological or bureaucratic cultures will
struggle with every phase of the migration because practices like TBD and CI require trust
and psychological safety.
Learning Culture
The organization invests in learning. Teams have time for experimentation, training, and
conference attendance. Knowledge is shared across teams.
Migration relevance: All phases. The CD migration is a learning journey. Teams need time
and space to learn new practices, make mistakes, and improve.
Collaboration Among Teams
Development, operations, security, and product teams work together rather than in silos.
Handoffs are minimized. Shared responsibility replaces blame.
Migration relevance: All phases, but especially Phase 2 – Pipeline
where the pipeline must encode the quality criteria from all disciplines (security, testing,
operations) into automated gates.
Job Satisfaction
Team members find their work meaningful and have the autonomy and resources to do it well.
High job satisfaction predicts high delivery performance (the relationship is bidirectional).
Migration relevance: The migration itself should improve job satisfaction by reducing
toil, eliminating painful manual processes, and giving teams faster feedback on their work.
If the migration is experienced as a burden rather than an improvement, something is wrong
with the approach.
Leaders support the migration with vision, resources, and organizational air cover. They
remove impediments, set direction, and create the conditions for teams to succeed without
micromanaging the details.
Migration relevance: All phases. Without leadership support, the migration will stall
when it encounters the first organizational blocker (budget for tools, policy changes for
deployment processes, cross-team coordination).
Capability Maturity by Phase
The following table maps each DORA capability to the migration phase where it is most actively
developed:
| Capability |
Phase 0 |
Phase 1 |
Phase 2 |
Phase 3 |
Phase 4 |
| Version control |
Prerequisite |
|
|
|
|
| Continuous integration |
|
Primary |
|
|
|
| Deployment automation |
|
|
Primary |
|
|
| Trunk-based development |
|
Primary |
|
|
|
| Test automation |
|
Primary |
Expanded |
|
|
| Test data management |
|
|
Primary |
|
|
| Shift left on security |
|
|
Primary |
|
|
| Loosely coupled architecture |
|
|
|
Primary |
|
| Empowered teams |
Ongoing |
Ongoing |
Ongoing |
Ongoing |
Ongoing |
| Customer feedback |
|
|
|
|
Primary |
| Value stream visibility |
Primary |
|
|
Revisited |
|
| Working in small batches |
|
Started |
|
Primary |
|
| Team experimentation |
Ongoing |
Ongoing |
Ongoing |
Ongoing |
Ongoing |
| Limit WIP |
|
|
|
Primary |
|
| Visual management |
Started |
Ongoing |
Ongoing |
Ongoing |
Ongoing |
| Monitoring and observability |
|
|
Started |
Expanded |
Primary |
| Proactive notification |
|
|
|
|
Primary |
| Generative culture |
Ongoing |
Ongoing |
Ongoing |
Ongoing |
Ongoing |
| Learning culture |
Ongoing |
Ongoing |
Ongoing |
Ongoing |
Ongoing |
| Collaboration among teams |
|
Started |
Primary |
|
|
| Job satisfaction |
Ongoing |
Ongoing |
Ongoing |
Ongoing |
Ongoing |
| Transformational leadership |
Ongoing |
Ongoing |
Ongoing |
Ongoing |
Ongoing |
Using This Table
“Primary” means the phase where the capability is the main focus of improvement work.
“Ongoing” means the capability is relevant in every phase and should be continuously
nurtured. “Started” or “Expanded” means the capability is introduced or deepened in that
phase. No entry means the capability is not a primary concern in that phase, though it may
still be relevant.
This content is adapted from the Dojo Consortium,
licensed under CC BY 4.0.
8.5 - Resources
Books, videos, and further reading on continuous delivery and deployment.
Adapted from MinimumCD.org
This page collects the books, websites, and videos that inform the practices in this migration
guide. Resources are organized by topic and annotated with which migration phase they are most
relevant to.
Books
Continuous Delivery and Deployment
- Continuous Delivery Pipelines by Dave Farley
- A practical, focused guide to building CD pipelines. Farley covers pipeline design, testing
strategies, and deployment patterns in a direct, implementation-oriented style. Start here
if you want a concise guide to the pipeline practices in Phase 2.
- Most relevant to: Phase 2 – Pipeline
- Continuous Delivery by Jez Humble and Dave Farley
- The foundational text on CD. Published in 2010, it remains the most comprehensive treatment
of the principles and practices that make continuous delivery work. Covers version control
patterns, build automation, testing strategies, deployment pipelines, and release management.
If you read one book before starting your migration, read this one.
- Most relevant to: All phases
- Accelerate by Nicole Forsgren, Jez Humble, and Gene Kim
- Presents the DORA research findings that link technical practices to organizational
performance. Covers the four key metrics (deployment frequency, lead time, change failure
rate, MTTR) and the capabilities that predict high performance. Essential reading for anyone
who needs to make the business case for a CD migration.
- Most relevant to: Phase 0 – Assess and Phase 3 – Metrics-Driven Improvement
- Engineering the Digital Transformation by Gary Gruver
- Addresses the organizational and leadership challenges of large-scale delivery
transformation. Gruver draws on his experience leading transformations at HP and other large
enterprises. Particularly valuable for leaders sponsoring a migration who need to understand
the change management, communication, and sequencing challenges ahead.
- Most relevant to: Organizational leadership across all phases
- Release It! by Michael T. Nygard
- Covers the design and architecture patterns that make production systems resilient. Topics
include stability patterns (circuit breakers, bulkheads, timeouts), deployment patterns, and
the operational realities of running software at scale. Essential reading before entering
Phase 4, where the team has the capability to deploy any change on demand.
- Most relevant to: Phase 4 – Deliver on Demand and Phase 2 – Rollback
- The DevOps Handbook by Gene Kim, Jez Humble, Patrick Debois, and John Willis
- A practical companion to The Phoenix Project. Covers the Three Ways (flow, feedback, and
continuous learning) and provides detailed guidance on implementing DevOps practices. Useful
as a reference throughout the migration.
- Most relevant to: All phases
- The Phoenix Project by Gene Kim, Kevin Behr, and George Spafford
- A novel that illustrates DevOps principles through the story of a fictional IT organization
in crisis. Useful for building organizational understanding of why delivery improvement
matters, especially for stakeholders who will not read a technical book.
- Most relevant to: Building organizational buy-in during Phase 0
Testing
- Growing Object-Oriented Software, Guided by Tests by Steve Freeman and Nat Pryce
- The definitive guide to test-driven development in practice. Goes beyond unit testing to
cover acceptance testing, test doubles, and how TDD drives design. Essential reading for
Phase 1 testing fundamentals.
- Most relevant to: Phase 1 – Testing Fundamentals
- Working Effectively with Legacy Code by Michael Feathers
- Practical techniques for adding tests to untested code, breaking dependencies, and
incrementally improving code that was not designed for testability. Indispensable if your
migration starts with a codebase that has little or no automated testing.
- Most relevant to: Phase 1 – Testing Fundamentals
Work Decomposition and Flow
- User Story Mapping by Jeff Patton
- A practical guide to breaking features into deliverable increments using story maps. Patton’s
approach directly supports the vertical slicing discipline required for small batch delivery.
- Most relevant to: Phase 1 – Work Decomposition
- The Principles of Product Development Flow by Donald Reinertsen
- A rigorous treatment of flow economics in product development. Covers queue theory, batch
size economics, WIP limits, and the cost of delay. Dense but transformative. Reading this
book will change how you think about every aspect of your delivery process.
- Most relevant to: Phase 3 – Optimize
- Making Work Visible by Dominica DeGrandis
- Focuses on identifying and eliminating the “time thieves” that steal productivity: too much
WIP, unknown dependencies, unplanned work, conflicting priorities, and neglected work. A
practical companion to the WIP limiting practices in Phase 3.
- Most relevant to: Phase 3 – Limiting WIP
Architecture
- Building Microservices by Sam Newman
- Covers the architectural patterns that enable independent deployment, including service
boundaries, API design, data management, and testing strategies for distributed systems.
- Most relevant to: Phase 3 – Architecture Decoupling
- Team Topologies by Matthew Skelton and Manuel Pais
- Addresses the relationship between team structure and software architecture (Conway’s Law in
practice). Covers team types, interaction modes, and how to evolve team structures to support
fast flow. Valuable for addressing the organizational blockers that surface throughout the
migration.
- Most relevant to: Organizational design across all phases
Websites
- MinimumCD.org
- Defines the minimum set of practices required to claim you are doing continuous delivery.
This migration guide uses the MinimumCD definition as its target state. Start here to
understand what CD actually requires.
- Dojo Consortium
- A community-maintained collection of CD practices, metrics definitions, and improvement
patterns. Many of the definitions and frameworks in this guide are adapted from the Dojo
Consortium’s work.
- DORA (dora.dev)
- The DevOps Research and Assessment site, which publishes the annual State of DevOps report
and provides resources for measuring and improving delivery performance.
- Trunk-Based Development
- The comprehensive reference for trunk-based development patterns. Covers short-lived
feature branches, feature flags, branch by abstraction, and release branching strategies.
- Martin Fowler’s blog (martinfowler.com)
- Martin Fowler’s site contains authoritative articles on continuous integration, continuous
delivery, microservices, refactoring, and software design. Key articles include
“Continuous Integration” and “Continuous Delivery.”
- Google Cloud Architecture Center – DevOps
- Google’s public documentation of the DORA capabilities, including self-assessment tools and
implementation guidance.
Videos
- “Continuous Delivery” by Dave Farley (YouTube channel)
- Dave Farley’s YouTube channel provides weekly videos covering CD practices, pipeline design,
testing strategies, and software engineering principles. Accessible and practical.
- Most relevant to: All phases
- “Continuous Delivery” by Jez Humble (various conference talks)
- Jez Humble’s conference presentations cover the principles and research behind CD. His talk
“Why Continuous Delivery?” is an excellent introduction for teams and stakeholders who are
new to the concept.
- Most relevant to: Building understanding during Phase 0
- “Refactoring” and “TDD” talks by Martin Fowler and Kent Beck
- Foundational talks on the development practices that support CD. Understanding TDD and
refactoring is essential for Phase 1 testing fundamentals.
- Most relevant to: Phase 1 – Foundations
- “The Smallest Thing That Could Possibly Work” by Bryan Finster
- Covers the work decomposition and small batch delivery practices that are central to this
migration guide. Focuses on practical techniques for breaking work into vertical slices.
- Most relevant to: Phase 1 – Work Decomposition and Phase 3 – Small Batches
Recommended Reading Order
If you are starting your migration and want to read in the most useful order:
- Accelerate – to understand the research and build the business case
- Continuous Delivery (Humble & Farley) – to understand the full picture
- Continuous Delivery Pipelines (Farley) – for practical pipeline implementation
- Working Effectively with Legacy Code – if your codebase lacks tests
- The Principles of Product Development Flow – to understand flow optimization
- Release It! – before moving to continuous deployment
Migration Tip
You do not need to read all of these before starting your migration. Start with the practices
in Phase 1, read Accelerate for the business case, and refer to the other resources as you
reach the relevant migration phase. The most important thing is to start delivering
improvements, not to finish a reading list.
This content is adapted from MinimumCD.org,
licensed under CC BY 4.0.
8.6 - Metrics
Detailed definitions for key delivery metrics. Understand what to measure and why.
These metrics help you assess your current delivery performance and track improvement
over time. Start with the metrics most relevant to your current phase.
Key Metrics
This content is adapted from the Dojo Consortium,
licensed under CC BY 4.0.
8.6.1 - Integration Frequency
How often developers integrate code changes to the trunk – a leading indicator of CI maturity and small batch delivery.
Definition
Integration Frequency measures the average number of production-ready pull requests
a team merges to trunk per day, normalized by team size. On a team of five
developers, healthy continuous integration practice produces at least five
integrations per day – roughly one per developer.
This metric is a direct indicator of how well a team practices
Continuous Integration.
Teams that integrate frequently work in small batches, receive fast feedback, and
reduce the risk associated with large, infrequent merges.
A value of 1.0 or higher per developer per day indicates that work is being
decomposed into small, independently deliverable increments.
How to Measure
- Count trunk merges. Track the number of pull requests (or direct commits)
merged to
main or trunk each day.
- Normalize by team size. Divide the daily count by the number of developers
actively contributing that day.
- Calculate the rolling average. Use a 5-day or 10-day rolling window to
smooth daily variation and surface meaningful trends.
Most source control platforms expose this data through their APIs:
- GitHub – list merged pull requests via the REST or GraphQL API.
- GitLab – query merged merge requests per project.
- Bitbucket – use the pull request activity endpoint.
Alternatively, count commits to the default branch if pull requests are not used.
Targets
| Level |
Integration Frequency (per developer per day) |
| Low |
Less than 1 per week |
| Medium |
A few times per week |
| High |
Once per day |
| Elite |
Multiple times per day |
The elite target aligns with trunk-based development, where developers push small
changes to the trunk multiple times daily and rely on automated testing and feature
flags to manage risk.
Common Pitfalls
- Meaningless commits. Teams may inflate the count by integrating trivial or
empty changes. Pair this metric with code review quality and defect rate.
- Breaking the trunk. Pushing faster without adequate test coverage leads to a
red build and slows the entire team. Always pair Integration Frequency with build
success rate and Change Fail Rate.
- Counting the wrong thing. Merges to long-lived feature branches do not count.
Only merges to the trunk or main integration branch reflect true CI practice.
- Ignoring quality. If defect rates rise as integration
frequency increases, the team is skipping quality steps. Use defect rate as a
guardrail metric.
Connection to CD
Integration Frequency is the foundational metric for Continuous Delivery. Without
frequent integration, every downstream metric suffers:
- Smaller batches reduce risk. Each integration carries less change, making
failures easier to diagnose and fix.
- Faster feedback loops. Frequent integration means the CI pipeline runs more
often, catching issues within minutes instead of days.
- Enables trunk-based development. High integration frequency is incompatible
with long-lived branches. Teams naturally move toward short-lived branches or
direct trunk commits.
- Reduces merge conflicts. The longer code stays on a branch, the more likely
it diverges from trunk. Frequent integration keeps the delta small.
- Prerequisite for deployment frequency. You cannot deploy more often than you
integrate. Improving this metric directly unblocks improvements to
Release Frequency.
To improve Integration Frequency:
- Decompose stories into smaller increments using
Behavior-Driven Development.
- Use Test-Driven Development to produce modular, independently testable code.
- Adopt feature flags or branch by abstraction to decouple integration from release.
- Practice Trunk-Based Development with
short-lived branches lasting less than one day.
This content is adapted from the Dojo Consortium,
licensed under CC BY 4.0.
8.6.2 - Build Duration
Time from code commit to a deployable artifact – a critical constraint on feedback speed and mean time to repair.
Definition
Build Duration measures the elapsed time from when a developer pushes a commit
until the CI pipeline produces a deployable artifact and all automated quality
gates have passed. This includes compilation, unit tests, integration tests, static
analysis, security scans, and artifact packaging.
Build Duration represents the minimum possible time between deciding to make a
change and having that change ready for production. It sets a hard floor on
Lead Time and directly constrains how quickly a team can
respond to production incidents.
This metric is sometimes referred to as “pipeline cycle time” or “CI cycle time.”
The book Accelerate references it as part of “hard lead time.”
How to Measure
- Record the commit timestamp. Capture when the commit arrives at the CI
server (webhook receipt or pipeline trigger time).
- Record the artifact-ready timestamp. Capture when the final pipeline stage
completes successfully and the deployable artifact is published.
- Calculate the difference. Subtract the commit timestamp from the
artifact-ready timestamp.
- Track the median and p95. The median shows typical performance. The 95th
percentile reveals worst-case builds that block developers.
Most CI platforms expose build duration natively:
- GitHub Actions –
createdAt and updatedAt on workflow runs.
- GitLab CI – pipeline
created_at and finished_at.
- Jenkins – build start time and duration fields.
- CircleCI – workflow duration in the Insights dashboard.
Set up alerts when builds exceed your target threshold so the team can investigate
regressions immediately.
Targets
| Level |
Build Duration |
| Low |
More than 30 minutes |
| Medium |
10 – 30 minutes |
| High |
5 – 10 minutes |
| Elite |
Less than 5 minutes |
The ten-minute threshold is a widely recognized guideline. Builds longer than ten
minutes break developer flow, discourage frequent integration, and increase the
cost of fixing failures.
Common Pitfalls
- Removing tests to hit targets. Reducing test count or skipping test types
(integration, security) lowers build duration but degrades quality. Always pair
this metric with Change Fail Rate and defect rate.
- Ignoring queue time. If builds wait in a queue before execution, the
developer experiences the queue time as part of the feedback delay even though it
is not technically “build” time. Measure wall-clock time from commit to result.
- Optimizing the wrong stage. Profile the pipeline before optimizing. Often a
single slow test suite or a sequential step that could run in parallel dominates
the total duration.
- Flaky tests. Tests that intermittently fail cause retries, effectively
doubling or tripling build duration. Track flake rate alongside build duration.
Connection to CD
Build Duration is a critical bottleneck in the Continuous Delivery pipeline:
- Constrains Mean Time to Repair. When production is down, the build pipeline
is the minimum time to get a fix deployed. A 30-minute build means at least 30
minutes of downtime for any fix, no matter how small. Reducing build duration
directly improves MTTR.
- Enables frequent integration. Developers are unlikely to integrate multiple
times per day if each integration takes 30 minutes to validate. Short builds
encourage higher Integration Frequency.
- Shortens feedback loops. The sooner a developer learns that a change broke
something, the less context they have lost and the cheaper the fix. Builds under
ten minutes keep developers in flow.
- Supports continuous deployment. Automated deployment pipelines cannot deliver
changes rapidly if the build stage is slow. Build duration is often the largest
component of Lead Time.
To improve Build Duration:
- Parallelize stages. Run unit tests, linting, and security scans concurrently
rather than sequentially.
- Replace slow end-to-end tests. Move heavyweight end-to-end tests to an
asynchronous post-deploy verification stage. Use contract tests and service
virtualization in the main pipeline.
- Decompose large services. Smaller codebases compile and test faster. If build
duration is stubbornly high, consider breaking the service into smaller domains.
- Cache aggressively. Cache dependencies, Docker layers, and compilation
artifacts between builds.
- Set a build time budget. Alert the team whenever a new test or step pushes
the build past your target, so test efficiency is continuously maintained.
This content is adapted from the Dojo Consortium,
licensed under CC BY 4.0.
8.6.3 - Development Cycle Time
Average time from when work starts until it is running in production – a key flow metric for identifying delivery bottlenecks.
Definition
Development Cycle Time measures the elapsed time from when a developer begins work
on a story or task until that work is deployed to production and available to users.
It captures the full construction phase of delivery: coding, code review, testing,
integration, and deployment.
This is distinct from Lead Time, which includes the time a request
spends waiting in the backlog before work begins. Development Cycle Time focuses
exclusively on the active delivery phase.
The Accelerate research uses “lead time for changes” (measured from commit to
production) as a key DORA metric. Development Cycle Time extends this slightly
further back to when work starts, capturing the full development process including
any time between starting work and the first commit.
How to Measure
- Record when work starts. Capture the timestamp when a story moves to
“In Progress” in your issue tracker, or when the first commit for the story
appears.
- Record when work reaches production. Capture the timestamp of the
production deployment that includes the completed story.
- Calculate the difference. Subtract the start time from the production
deploy time.
- Report the median and distribution. The median provides a typical value.
The distribution (or a control chart) reveals variability and outliers that
indicate process problems.
Sources for this data include:
- Issue trackers (Jira, GitHub Issues, Azure Boards) – status transition
timestamps.
- Source control – first commit timestamp associated with a story.
- Deployment logs – timestamp of production deployments linked to stories.
Linking stories to deployments is essential. Use commit message conventions (e.g.,
story IDs in commit messages) or deployment metadata to create this connection.
Targets
| Level |
Development Cycle Time |
| Low |
More than 2 weeks |
| Medium |
1 – 2 weeks |
| High |
2 – 7 days |
| Elite |
Less than 2 days |
Elite teams deliver completed work to production within one to two days of starting
it. This is achievable only when work is decomposed into small increments, the
pipeline is fast, and deployment is automated.
Common Pitfalls
- Marking work “Done” before it reaches production. If “Done” means “code
complete” rather than “deployed,” the metric understates actual cycle time. The
Definition of Done must include production deployment.
- Skipping the backlog. Moving items from “Backlog” directly to “Done” after
deploying hides the true wait time and development duration. Ensure stories pass
through the standard workflow stages.
- Splitting work into functional tasks. Breaking a story into separate
“development,” “testing,” and “deployment” tasks obscures the end-to-end cycle
time. Measure at the story or feature level.
- Ignoring variability. A low average can hide a bimodal distribution where
some stories take hours and others take weeks. Use a control chart or histogram
to expose the full picture.
- Optimizing for speed without quality. If cycle time drops but
Change Fail Rate rises, the team is cutting corners.
Use quality metrics as guardrails.
Connection to CD
Development Cycle Time is the most comprehensive measure of delivery flow and sits
at the heart of Continuous Delivery:
- Exposes bottlenecks. A long cycle time reveals where work gets stuck –
waiting for code review, queued for testing, blocked by a manual approval, or
delayed by a slow pipeline. Each bottleneck is a target for improvement.
- Drives smaller batches. The only way to achieve a cycle time under two days
is to decompose work into very small increments. This naturally leads to smaller
changes, less risk, and faster feedback.
- Reduces waste from changing priorities. Long cycle times mean work in progress
is exposed to priority changes, context switches, and scope creep. Shorter cycles
reduce the window of vulnerability.
- Improves feedback quality. The sooner a change reaches production, the sooner
the team gets real user feedback. Short cycle times enable rapid learning and
course correction.
- Subsumes other metrics. Cycle time is affected by Integration
Frequency, Build Duration,
and Work in Progress. Improving any of these upstream
metrics will reduce cycle time.
To improve Development Cycle Time:
- Decompose work into stories that can be completed and deployed within one to two
days.
- Remove handoffs between teams (e.g., separate dev and QA teams).
- Automate the build and deploy pipeline to eliminate manual steps.
- Improve test design so the pipeline runs faster without sacrificing coverage.
- Limit Work in Progress so the team focuses on finishing
work rather than starting new items.
This content is adapted from the Dojo Consortium,
licensed under CC BY 4.0.
8.6.4 - Lead Time
Total time from when a change is committed until it is running in production – a DORA key metric for delivery throughput.
Definition
Lead Time measures the total elapsed time from when a code change is committed to
the version control system until that change is successfully running in production.
This is one of the four key metrics identified by the DORA (DevOps Research and
Assessment) team as a predictor of software delivery performance.
In the broader value stream, “lead time” can also refer to the time from a customer
request to delivery. The DORA definition focuses specifically on the segment from
commit to production, which the Accelerate research calls “lead time for changes.”
This narrower definition captures the efficiency of your delivery pipeline and
deployment process.
Lead Time includes Build Duration plus any additional time
for deployment, approval gates, environment provisioning, and post-deploy
verification. It is a superset of build time and a subset of
Development Cycle Time, which also includes the
coding phase before the first commit.
How to Measure
- Record the commit timestamp. Use the timestamp of the commit as recorded in
source control (not the local author timestamp, but the time it was pushed or
merged to the trunk).
- Record the production deployment timestamp. Capture when the deployment
containing that commit completes successfully in production.
- Calculate the difference. Subtract the commit time from the deploy time.
- Aggregate across commits. Report the median lead time across all commits
deployed in a given period (daily, weekly, or per release).
Data sources:
- Source control – commit or merge timestamps from Git, GitHub, GitLab, etc.
- CI/CD platform – pipeline completion times from Jenkins, GitHub Actions,
GitLab CI, etc.
- Deployment tooling – production deployment timestamps from Argo CD, Spinnaker,
Flux, or custom scripts.
For teams practicing continuous deployment, lead time may be nearly identical to
build duration. For teams with manual approval gates or scheduled release windows,
lead time will be significantly longer.
Targets
| Level |
Lead Time for Changes |
| Low |
More than 6 months |
| Medium |
1 – 6 months |
| High |
1 day – 1 week |
| Elite |
Less than 1 hour |
These levels are drawn from the DORA State of DevOps research. Elite performers
deliver changes to production in under an hour from commit, enabled by fully
automated pipelines and continuous deployment.
Common Pitfalls
- Measuring only build time. Lead time includes everything after the commit,
not just the CI pipeline. Manual approval gates, scheduled deployment windows,
and environment provisioning delays must all be included.
- Ignoring waiting time. A change may sit in a queue waiting for a release
train, a change advisory board (CAB) review, or a deployment window. This wait
time is part of lead time and often dominates the total.
- Tracking requests instead of commits. Some teams measure from customer request
to delivery. While valuable, this conflates backlog prioritization with delivery
efficiency. Keep this metric focused on the commit-to-production segment.
- Hiding items from the backlog. Requests tracked in spreadsheets or side
channels before entering the backlog distort lead time measurements. Ensure all
work enters the system of record promptly.
- Reducing quality to reduce lead time. Shortening approval processes or
skipping test stages reduces lead time at the cost of quality. Pair this metric
with Change Fail Rate as a guardrail.
Connection to CD
Lead Time is one of the four DORA metrics and a direct measure of your delivery
pipeline’s end-to-end efficiency:
- Reveals pipeline bottlenecks. A large gap between build duration and lead time
points to manual processes, approval queues, or deployment delays that the team
can target for automation.
- Measures the cost of failure recovery. When production breaks, lead time is
the minimum time to deliver a fix (unless you roll back). This makes lead time
a direct input to Mean Time to Repair.
- Drives automation. The primary way to reduce lead time is to automate every
step between commit and production: build, test, security scanning, environment
provisioning, deployment, and verification.
- Reflects deployment strategy. Teams using continuous deployment have lead
times measured in minutes. Teams using weekly release trains have lead times
measured in days. The metric makes the cost of batching visible.
- Connects speed and stability. The DORA research shows that elite performers
achieve both low lead time and low Change Fail Rate.
Speed and quality are not trade-offs – they reinforce each other when the
delivery system is well-designed.
To improve Lead Time:
- Automate the deployment pipeline end to end, eliminating manual gates.
- Replace change advisory board (CAB) reviews with automated policy checks and
peer review.
- Deploy on every successful build rather than batching changes into release trains.
- Reduce Build Duration to shrink the largest component of
lead time.
- Monitor and eliminate environment provisioning delays.
This content is adapted from the Dojo Consortium,
licensed under CC BY 4.0.
8.6.5 - Change Fail Rate
Percentage of production deployments that cause a failure or require remediation – a DORA key metric for delivery stability.
Definition
Change Fail Rate measures the percentage of deployments to production that result
in degraded service, negative customer impact, or require immediate remediation
such as a rollback, hotfix, or patch.
A “failed change” includes any deployment that:
- Is rolled back.
- Requires a hotfix deployed within a short window (commonly 24 hours).
- Triggers a production incident attributed to the change.
- Requires manual intervention to restore service.
This is one of the four DORA key metrics. It measures the stability side of
delivery performance, complementing the throughput metrics of
Lead Time and Release Frequency.
How to Measure
- Count total production deployments over a defined period (weekly, monthly).
- Count deployments classified as failures using the criteria above.
- Divide failures by total deployments and express as a percentage.
Data sources:
- Deployment logs – total deployment count from your CD platform.
- Incident management – incidents linked to specific deployments (PagerDuty,
Opsgenie, ServiceNow).
- Rollback records – deployments that were reverted, either manually or by
automated rollback.
- Hotfix tracking – deployments tagged as hotfixes or emergency changes.
Automate the classification where possible. For example, if a deployment is
followed by another deployment of the same service within a defined window (e.g.,
one hour), flag the original as a potential failure for review.
Targets
| Level |
Change Fail Rate |
| Low |
46 – 60% |
| Medium |
16 – 45% |
| High |
0 – 15% |
| Elite |
0 – 5% |
These levels are drawn from the DORA State of DevOps research. Elite performers
maintain a change fail rate below 5%, meaning fewer than 1 in 20 deployments causes
a problem.
Common Pitfalls
- Not recording failures. Deploying fixes without logging the original failure
understates the true rate. Ensure every incident and rollback is tracked.
- Reclassifying defects. Creating review processes that reclassify production
defects as “feature requests” or “known limitations” hides real failures.
- Inflating deployment count. Re-deploying the same working version to increase
the denominator artificially lowers the rate. Only count deployments that contain
new changes.
- Pursuing zero defects at the cost of speed. An obsessive focus on eliminating
all failures can slow Release Frequency to a crawl. A
small failure rate with fast recovery is preferable to near-zero failures with
monthly deployments.
- Ignoring near-misses. Changes that cause degraded performance but do not
trigger a full incident are still failures. Define clear criteria for what
constitutes a failed change and apply them consistently.
Connection to CD
Change Fail Rate is the primary quality signal in a Continuous Delivery pipeline:
- Validates pipeline quality gates. A rising change fail rate indicates that
the automated tests, security scans, and quality checks in the pipeline are not
catching enough defects. Each failure is an opportunity to add or improve a
quality gate.
- Enables confidence in frequent releases. Teams will only deploy frequently
if they trust the pipeline. A low change fail rate builds this trust and
supports higher Release Frequency.
- Smaller changes fail less. The DORA research consistently shows that smaller,
more frequent deployments have lower failure rates than large, infrequent
releases. Improving Integration Frequency naturally
improves this metric.
- Drives root cause analysis. Each failed change should trigger a blameless
investigation: what automated check could have caught this? The answers feed
directly into pipeline improvements.
- Balances throughput metrics. Change Fail Rate is the essential guardrail for
Lead Time and Release Frequency. If
those metrics improve while change fail rate worsens, the team is trading quality
for speed.
To improve Change Fail Rate:
- Deploy smaller changes more frequently to reduce the blast radius of failures.
- Identify the root cause of each failure and add automated checks to prevent
recurrence.
- Strengthen the test suite, particularly integration and contract tests that
validate interactions between services.
- Implement progressive delivery (canary releases, feature flags) to limit the
impact of defective changes before they reach all users.
- Conduct blameless post-incident reviews and feed learnings back into the
delivery pipeline.
This content is adapted from the Dojo Consortium,
licensed under CC BY 4.0.
8.6.6 - Mean Time to Repair
Average time from when a production incident is detected until service is restored – a DORA key metric for recovery capability.
Definition
Mean Time to Repair (MTTR) measures the average elapsed time between when a
production incident is detected and when it is fully resolved and service is
restored to normal operation.
MTTR reflects an organization’s ability to recover from failure. It encompasses
detection, diagnosis, fix development, build, deployment, and verification. A
short MTTR depends on the entire delivery system working well – fast builds,
automated deployments, good observability, and practiced incident response.
The Accelerate research identifies MTTR as one of the four key DORA metrics and
notes that “software delivery performance is a combination of lead time, release
frequency, and MTTR.” It is the stability counterpart to the throughput metrics.
How to Measure
- Record the detection timestamp. This is when the team first becomes aware of
the incident – typically when an alert fires, a customer reports an issue, or
monitoring detects an anomaly.
- Record the resolution timestamp. This is when the incident is resolved and
service is confirmed to be operating normally. Resolution means the customer
impact has ended, not merely that a fix has been deployed.
- Calculate the duration for each incident.
- Compute the average across all incidents in a given period.
Data sources:
- Incident management platforms – PagerDuty, Opsgenie, ServiceNow, or
Statuspage provide incident lifecycle timestamps.
- Monitoring and alerting – alert trigger times from Datadog, Prometheus
Alertmanager, CloudWatch, or equivalent.
- Deployment logs – timestamps of rollbacks or hotfix deployments.
Report both the mean and the median. The mean can be skewed by a single long
outage, so the median gives a better sense of typical recovery time. Also track
the maximum MTTR per period to highlight worst-case incidents.
Targets
| Level |
Mean Time to Repair |
| Low |
More than 1 week |
| Medium |
1 day – 1 week |
| High |
Less than 1 day |
| Elite |
Less than 1 hour |
Elite performers restore service in under one hour. This requires automated
rollback or roll-forward capability, fast build pipelines, and well-practiced
incident response processes.
Common Pitfalls
- Closing incidents prematurely. Marking an incident as resolved before the
customer impact has actually ended artificially deflates MTTR. Define “resolved”
clearly and verify that service is truly restored.
- Not counting detection time. If the team discovers a problem informally
(e.g., a developer notices something odd) and fixes it before opening an
incident, the time is not captured. Encourage consistent incident reporting.
- Ignoring recurring incidents. If the same issue keeps reappearing, each
individual MTTR may be short, but the cumulative impact is high. Track recurrence
as a separate quality signal.
- Conflating MTTR with MTTD. Mean Time to Detect (MTTD) and Mean Time to
Repair overlap but are distinct. If you only measure from alert to resolution,
you miss the detection gap – the time between when the problem starts and when
it is detected. Both matter.
- Optimizing MTTR without addressing root causes. Getting faster at fixing
recurring problems is good, but preventing those problems in the first place is
better. Pair MTTR with Change Fail Rate to ensure the
number of incidents is also decreasing.
Connection to CD
MTTR is a direct measure of how well the entire Continuous Delivery system supports
recovery:
- Pipeline speed is the floor. The minimum possible MTTR for a roll-forward
fix is the Build Duration plus deployment time. A 30-minute
build means you cannot restore service via a code fix in less than 30 minutes.
Reducing build duration directly reduces MTTR.
- Automated deployment enables fast recovery. Teams that can deploy with one
click or automatically can roll back or roll forward in minutes. Manual
deployment processes add significant time to every incident.
- Feature flags accelerate mitigation. If a failing change is behind a feature
flag, the team can disable it in seconds without deploying new code. This can
reduce MTTR from minutes to seconds for flag-protected changes.
- Observability shortens detection and diagnosis. Good logging, metrics, and
tracing help the team identify the cause of an incident quickly. Without
observability, diagnosis dominates the repair timeline.
- Practice improves performance. Teams that deploy frequently have more
experience responding to issues. High Release Frequency
correlates with lower MTTR because the team has well-rehearsed recovery
procedures.
- Trunk-based development simplifies rollback. When trunk is always deployable,
the team can roll back to the previous commit. Long-lived branches and complex
merge histories make rollback risky and slow.
To improve MTTR:
- Keep the pipeline always deployable so a fix can be deployed at any time.
- Reduce Build Duration to enable faster roll-forward.
- Implement feature flags for large changes so they can be disabled without
redeployment.
- Invest in observability – structured logging, distributed tracing, and
meaningful alerting.
- Practice incident response regularly, including deploying rollbacks and hotfixes.
- Conduct blameless post-incident reviews and feed learnings back into the pipeline
and monitoring.
This content is adapted from the Dojo Consortium,
licensed under CC BY 4.0.
8.6.7 - Release Frequency
How often changes are deployed to production – a DORA key metric for delivery throughput and team capability.
Definition
Release Frequency (also called Deployment Frequency) measures how often a team
successfully deploys changes to production. It is expressed as deployments per day,
per week, or per month, depending on the team’s current cadence.
This is one of the four DORA key metrics. It measures the throughput side of
delivery performance – how rapidly the team can get completed work into the hands
of users. Higher release frequency enables faster feedback, smaller batch sizes,
and reduced deployment risk.
Each deployment should deliver a meaningful change. Re-deploying the same artifact
or deploying empty changes does not count.
How to Measure
- Count production deployments. Record each successful deployment to the
production environment over a defined period.
- Exclude non-changes. Do not count re-deployments of unchanged artifacts,
infrastructure-only changes (unless relevant), or deployments to non-production
environments.
- Calculate frequency. Divide the count by the time period. Express as
deployments per day (for high performers) or per week/month (for teams earlier
in their journey).
Data sources:
- CD platforms – Argo CD, Spinnaker, Flux, Octopus Deploy, or similar tools
track every deployment.
- CI/CD pipeline logs – GitHub Actions, GitLab CI, Jenkins, and CircleCI
record deployment job executions.
- Cloud provider logs – AWS CodeDeploy, Azure DevOps, GCP Cloud Deploy, and
Kubernetes audit logs.
- Custom deployment scripts – Add a logging line that records the timestamp,
service name, and version to a central log or metrics system.
Targets
| Level |
Release Frequency |
| Low |
Less than once per 6 months |
| Medium |
Once per month to once per 6 months |
| High |
Once per week to once per month |
| Elite |
Multiple times per day |
These levels are drawn from the DORA State of DevOps research. Elite performers
deploy on demand, multiple times per day, with each deployment containing a small
set of changes.
Common Pitfalls
- Counting empty deployments. Re-deploying the same artifact or building
artifacts that contain no changes inflates the metric without delivering value.
Count only deployments with meaningful changes.
- Ignoring failed deployments. If you count deployments that are immediately
rolled back, the frequency looks good but the quality is poor. Pair with
Change Fail Rate to get the full picture.
- Equating frequency with value. Deploying frequently is a means, not an end.
Deploying 10 times a day delivers no value if the changes do not meet user needs.
Release Frequency measures capability, not outcome.
- Batch releasing to hit a target. Combining multiple changes into a single
release to deploy “more often” defeats the purpose. The goal is small, individual
changes flowing through the pipeline independently.
- Focusing on speed without quality. If release frequency increases but
Change Fail Rate also increases, the team is releasing
faster than its quality processes can support. Slow down and improve the pipeline.
Connection to CD
Release Frequency is the ultimate output metric of a Continuous Delivery pipeline:
- Validates the entire delivery system. High release frequency is only possible
when the pipeline is fast, tests are reliable, deployment is automated, and the
team has confidence in the process. It is the end-to-end proof that CD is working.
- Reduces deployment risk. Each deployment carries less change when deployments
are frequent. Less change means less risk, easier rollback, and simpler
debugging when something goes wrong.
- Enables rapid feedback. Frequent releases get features and fixes in front of
users sooner. This shortens the feedback loop and allows the team to course-correct
before investing heavily in the wrong direction.
- Exercises recovery capability. Teams that deploy frequently practice the
deployment process daily. When a production incident occurs, the deployment
process is well-rehearsed and reliable, directly improving
Mean Time to Repair.
- Decouples deploy from release. At high frequency, teams separate the act of
deploying code from the act of enabling features for users. Feature flags,
progressive delivery, and dark launches become standard practice.
To improve Release Frequency:
- Reduce Development Cycle Time by decomposing work
into smaller increments.
- Remove manual handoffs to other teams (e.g., ops, QA, change management).
- Automate every step of the deployment process, from build through production
verification.
- Replace manual change approval boards with automated policy checks and peer
review.
- Convert hard dependencies on other teams or services into soft dependencies using
feature flags and service virtualization.
- Adopt Trunk-Based Development so that
trunk is always in a deployable state.
This content is adapted from the Dojo Consortium,
licensed under CC BY 4.0.
8.6.8 - Work in Progress
Number of work items started but not yet completed – a leading indicator of flow problems, context switching, and delivery delays.
Definition
Work in Progress (WIP) is the total count of work items that have been started but
not yet completed and delivered to production. This includes all types of work:
stories, defects, tasks, spikes, and any other items that a team member has begun
but not finished.
WIP is a leading indicator from Lean manufacturing. Unlike trailing metrics such as
Development Cycle Time or
Lead Time, WIP tells you about problems that are happening right
now. High WIP predicts future delivery delays, increased cycle time, and lower
quality.
Little’s Law provides the mathematical relationship:
If throughput (the rate at which items are completed) stays constant, increasing WIP
directly increases cycle time. The only way to reduce cycle time without working
faster is to reduce WIP.
How to Measure
- Count all in-progress items. At a regular cadence (daily or at each standup),
count the number of items in any active state on your team’s board. Include
everything between “To Do” and “Done.”
- Normalize by team size. Divide WIP by the number of team members to get a
per-person ratio. This makes the metric comparable across teams of different sizes.
- Track over time. Record the WIP count daily and observe trends. A rising WIP
count is an early warning of delivery problems.
Data sources:
- Kanban boards – Jira, Azure Boards, Trello, GitHub Projects, or physical
boards. Count cards in any column between the backlog and done.
- Issue trackers – Query for items with an “In Progress,” “In Review,”
“In QA,” or equivalent active status.
- Manual count – At standup, ask: “How many things are we actively working on
right now?”
The simplest and most effective approach is to make WIP visible by keeping the team
board up to date and counting active items daily.
Targets
| Level |
WIP per Team |
| Low |
More than 2x team size |
| Medium |
Between 1x and 2x team size |
| High |
Equal to team size |
| Elite |
Less than team size (ideally half) |
The guiding principle is that WIP should never exceed team size. A team of five
should have at most five items in progress at any time. Elite teams often work
in pairs, bringing WIP to roughly half the team size.
Common Pitfalls
- Hiding work. Not moving items to “In Progress” when working on them keeps
WIP artificially low. The board must reflect reality. If someone is working on
it, it should be visible.
- Marking items done prematurely. Moving items to “Done” before they are
deployed to production understates WIP. The Definition of Done must include
production deployment.
- Creating micro-tasks. Splitting a single story into many small tasks
(development, testing, code review, deployment) and tracking each separately
inflates the item count without changing the actual work. Measure WIP at the
story or feature level.
- Ignoring unplanned work. Production support, urgent requests, and
interruptions consume capacity but are often not tracked on the board. If the
team is spending time on it, it is WIP and should be visible.
- Setting WIP limits but not enforcing them. WIP limits only work if the team
actually stops starting new work when the limit is reached. Treat WIP limits as
a hard constraint, not a suggestion.
Connection to CD
WIP is the most actionable flow metric and directly impacts every aspect of
Continuous Delivery:
- Predicts cycle time. Per Little’s Law, WIP and cycle time are directly
proportional. Reducing WIP is the fastest way to reduce
Development Cycle Time without changing anything
else about the delivery process.
- Reduces context switching. When developers juggle multiple items, they lose
time switching between contexts. Research consistently shows that each additional
item in progress reduces effective productivity. Low WIP means more focus and
faster completion.
- Exposes blockers. When WIP limits are in place and an item gets blocked, the
team cannot simply start something new. They must resolve the blocker first. This
forces the team to address systemic problems rather than working around them.
- Enables continuous flow. CD depends on a steady flow of small changes moving
through the pipeline. High WIP creates irregular, bursty delivery. Low WIP
creates smooth, predictable flow.
- Improves quality. When teams focus on fewer items, each item gets more
attention. Code reviews happen faster, testing is more thorough, and defects are
caught sooner. This naturally reduces Change Fail Rate.
- Supports trunk-based development. High WIP often correlates with many
long-lived branches. Reducing WIP encourages developers to complete and integrate
work before starting something new, which aligns with
Integration Frequency goals.
To reduce WIP:
- Set explicit WIP limits for the team and enforce them. Start with a limit equal
to team size and reduce it over time.
- Prioritize finishing work over starting new work. At standup, ask “What can I
help finish?” before “What should I start?”
- Prioritize code review and pairing to unblock teammates over picking up new items.
- Make the board visible and accurate. Use it as the single source of truth for
what the team is working on.
- Identify and address recurring blockers that cause items to stall in progress.
This content is adapted from the Dojo Consortium,
licensed under CC BY 4.0.
8.7 - Testing
Testing types, patterns, and best practices for building confidence in your delivery pipeline.
A reliable test suite is essential for continuous delivery. These pages cover the
different types of tests, when to use each, and best practices for test architecture.
Test Types
This content is adapted from the Dojo Consortium,
licensed under CC BY 4.0.
8.7.1 - Unit Tests
Fast, deterministic tests that verify individual functions, methods, or components in isolation with test doubles for dependencies.
Definition
A unit test is a deterministic test that exercises a discrete unit of the application – such as
a function, method, or UI component – in isolation to determine whether it behaves as expected.
All external dependencies are replaced with test doubles so the test runs
quickly and produces the same result every time.
When testing the behavior of functions, prefer testing public APIs (methods, interfaces,
exported functions) over private internals. Testing private implementation details creates
change-detector tests that break during routine refactoring without adding safety.
The purpose of unit tests is to:
- Verify the functionality of a single unit (method, class, function) in isolation.
- Cover high-complexity logic where many input permutations exist, such as business rules, calculations, and state transitions.
- Keep cyclomatic complexity visible and manageable through good separation of concerns.
When to Use
- During development – run the relevant subset of unit tests continuously while writing
code. TDD (Red-Green-Refactor) is the most effective workflow.
- On every commit – use pre-commit hooks or watch-mode test runners so broken tests never
reach the remote repository.
- In CI – execute the full unit test suite on every pull request and on the trunk after
merge to verify nothing was missed locally.
Unit tests are the right choice when the behavior under test can be exercised without network
access, file system access, or database connections. If you need any of those, you likely need
an integration test or a functional test instead.
Characteristics
| Property |
Value |
| Speed |
Milliseconds per test |
| Determinism |
Always deterministic |
| Scope |
Single function, method, or component |
| Dependencies |
All replaced with test doubles |
| Network |
None |
| Database |
None |
| Breaks build |
Yes |
Examples
A JavaScript unit test verifying a pure utility function:
A Java unit test using Mockito to isolate the system under test:
Anti-Patterns
- Testing private methods – private implementations are meant to change. Test the public
interface that calls them instead.
- No assertions – a test that runs code without asserting anything provides false
confidence. Lint rules like
jest/expect-expect can catch this.
- Disabling or skipping tests – skipped tests erode confidence over time. Fix or remove
them.
- Testing implementation details – asserting on internal state or call order rather than
observable output creates brittle tests that break during refactoring.
- Ice cream cone testing – relying primarily on slow E2E tests while neglecting fast unit
tests inverts the test pyramid and slows feedback.
- Chasing coverage numbers – gaming coverage metrics (e.g., running code paths without
meaningful assertions) creates a false sense of confidence. Focus on use-case coverage
instead.
Connection to CD Pipeline
Unit tests occupy the base of the test pyramid. They run in the earliest stages of the
CI/CD pipeline and provide the fastest feedback loop:
- Local development – watch mode reruns tests on every save.
- Pre-commit – hooks run the suite before code reaches version control.
- PR verification – CI runs the full suite and blocks merge on failure.
- Trunk verification – CI reruns tests on the merged HEAD to catch integration issues.
Because unit tests are fast and deterministic, they should always break the build on failure.
A healthy CD pipeline depends on a large, reliable unit test suite that gives developers
confidence to ship small changes frequently.
This content is adapted from the Dojo Consortium,
licensed under CC BY 4.0.
8.7.2 - Integration Tests
Deterministic tests that verify how units interact together or with external system boundaries using test doubles for non-deterministic dependencies.
Definition
An integration test is a deterministic test that verifies how the unit under test interacts
with other units without directly accessing external sub-systems. It may validate multiple
units working together (sometimes called a “sociable unit test”) or the portion of the code
that interfaces with an external network dependency while using a test double to represent
that dependency.
For clarity: an “integration test” is not a test that broadly integrates multiple
sub-systems. That is an end-to-end test.
When to Use
Integration tests provide the best balance of speed, confidence, and cost. Use them when:
- You need to verify that multiple units collaborate correctly – for example, a service
calling a repository that calls a data mapper.
- You need to validate the interface layer to an external system (HTTP client, message
producer, database query) while keeping the external system replaced by a test double.
- You want to confirm that a refactoring did not break behavior. Integration tests that
avoid testing implementation details survive refactors without modification.
- You are building a front-end component that composes child components and needs to verify
the assembled behavior from the user’s perspective.
If the test requires a live network call to a system outside localhost, it is either a
contract test or an E2E test.
Characteristics
| Property |
Value |
| Speed |
Milliseconds to low seconds |
| Determinism |
Always deterministic |
| Scope |
Multiple units or a unit plus its boundary |
| Dependencies |
External systems replaced with test doubles |
| Network |
Localhost only |
| Database |
Localhost / in-memory only |
| Breaks build |
Yes |
Examples
A JavaScript integration test verifying that a connector returns structured data:
Subcategories
Service integration tests – Validate how the system under test responds to information
from an external service. Use virtual services or static mocks; pair with
contract tests to keep the doubles current.
Database integration tests – Validate query logic against a controlled data store. Prefer
in-memory databases, isolated DB instances, or personalized datasets over shared live data.
Front-end integration tests – Render the component tree and interact with it the way a
user would. Follow the accessibility order of operations for element selection: visible text
and labels first, ARIA roles second, test IDs only as a last resort.
Anti-Patterns
- Peeking behind the curtain – using tools that expose component internals (e.g.,
Enzyme’s
instance() or state()) instead of testing from the user’s perspective.
- Mocking too aggressively – replacing every collaborator turns an integration test into a
unit test and removes the value of testing real interactions. Only mock what is necessary to
maintain determinism.
- Testing implementation details – asserting on internal state, private methods, or call
counts rather than observable output.
- Introducing a test user – creating an artificial actor that would never exist in
production. Write tests from the perspective of a real end-user or API consumer.
- Tolerating flaky tests – non-deterministic integration tests erode trust. Fix or remove
them immediately.
- Duplicating E2E scope – if the test integrates multiple deployed sub-systems with live
network calls, it belongs in the E2E category, not here.
Connection to CD Pipeline
Integration tests form the largest portion of a healthy test suite (the “trophy” or the
middle of the pyramid). They run alongside unit tests in the earliest CI stages:
- Local development – run in watch mode or before committing.
- PR verification – CI executes the full suite; failures block merge.
- Trunk verification – CI reruns on the merged HEAD.
Because they are deterministic and fast, integration tests should always break the build.
A team whose refactors break many tests likely has too few integration tests and too many
fine-grained unit tests. As Kent C. Dodds advises: “Write tests, not too many, mostly
integration.”
This content is adapted from the Dojo Consortium,
licensed under CC BY 4.0.
8.7.3 - Functional Tests
Deterministic tests that verify all modules of a sub-system work together from the actor’s perspective, using test doubles for external dependencies.
Definition
A functional test is a deterministic test that verifies all modules of a sub-system are
working together. It introduces an actor – typically a user interacting with the UI or a
consumer calling an API – and validates the ingress and egress of that actor within the
system boundary. External sub-systems are replaced with test doubles to
keep the test deterministic.
Functional tests cover broad-spectrum behavior: UI interactions, presentation logic, and
business logic flowing through the full sub-system. They differ from
end-to-end tests in that side effects are mocked and never cross boundaries
outside the system’s control.
Functional tests are sometimes called component tests.
When to Use
- You need to verify a complete user-facing feature from input to output within a single
deployable unit (e.g., a service or a front-end application).
- You want to test how the UI, business logic, and data layers interact without depending
on live external services.
- You need to simulate realistic user workflows – filling in forms, navigating pages,
submitting API requests – while keeping the test fast and repeatable.
- You are validating acceptance criteria for a user story and want a test that maps
directly to the specified behavior.
If the test needs to reach a live external dependency, it is an E2E test. If it
tests a single unit in isolation, it is a unit test.
Characteristics
| Property |
Value |
| Speed |
Seconds (slower than unit, faster than E2E) |
| Determinism |
Always deterministic |
| Scope |
All modules within a single sub-system |
| Dependencies |
External systems replaced with test doubles |
| Network |
Localhost only |
| Database |
Localhost / in-memory only |
| Breaks build |
Yes |
Examples
A functional test for a REST API using an in-process server and mocked downstream services:
A front-end functional test exercising a login flow with a mocked auth service:
Anti-Patterns
- Using live external services – this makes the test non-deterministic and slow. Use test
doubles for anything outside the sub-system boundary.
- Testing through the database – sharing a live database between tests introduces ordering
dependencies and flakiness. Use in-memory databases or mocked data layers.
- Ignoring the actor’s perspective – functional tests should interact with the system the
way a user or consumer would. Reaching into internal APIs or bypassing the UI defeats the
purpose.
- Duplicating unit test coverage – functional tests should focus on feature-level behavior
and happy/critical paths, not every edge case. Leave permutation testing to unit tests.
- Slow test setup – if spinning up the sub-system takes too long, invest in faster
bootstrapping (in-memory stores, lazy initialization) rather than skipping functional tests.
Connection to CD Pipeline
Functional tests run after unit and integration tests in the pipeline, typically as part of
the same CI stage:
- PR verification – functional tests run against the sub-system in isolation, giving
confidence that the feature works before merge.
- Trunk verification – the same tests run on the merged HEAD to catch conflicts.
- Pre-deployment gate – functional tests can serve as the final deterministic gate before
a build artifact is promoted to a staging environment.
Because functional tests are deterministic, they should break the build on failure.
They are more expensive than unit and integration tests, so teams should focus on
happy-path and critical-path scenarios while keeping the total count manageable.
This content is adapted from the Dojo Consortium,
licensed under CC BY 4.0.
8.7.4 - End-to-End Tests
Non-deterministic tests that validate the entire software system along with its integration with external interfaces and production-like scenarios.
Definition
End-to-end (E2E) tests validate the entire software system, including its integration with
external interfaces. They exercise complete production-like scenarios using real (or
production-like) data and environments to simulate real-time settings. No test doubles are
used – the test hits live services, databases, and third-party integrations just as a real
user would.
Because they depend on external systems, E2E tests are typically non-deterministic: they
can fail for reasons unrelated to code correctness, such as network instability or
third-party outages.
When to Use
E2E tests should be the least-used test type due to their high cost in execution time and
maintenance. Use them for:
- Happy-path validation of critical business flows (e.g., user signup, checkout, payment
processing).
- Smoke testing a deployed environment to verify that key integrations are functioning.
- Cross-team workflows that span multiple sub-systems and cannot be tested any other way.
Do not use E2E tests to cover edge cases, error handling, or input validation – those
scenarios belong in unit, integration, or
functional tests.
Vertical vs. Horizontal E2E Tests
Vertical E2E tests target features under the control of a single team:
- Favoriting an item and verifying it persists across refresh.
- Creating a saved list and adding items to it.
Horizontal E2E tests span multiple teams:
- Navigating from the homepage through search, item detail, cart, and checkout.
Horizontal tests are significantly more complex and fragile. Due to their large failure
surface area, they are not suitable for blocking release pipelines.
Characteristics
| Property |
Value |
| Speed |
Seconds to minutes per test |
| Determinism |
Typically non-deterministic |
| Scope |
Full system including external integrations |
| Dependencies |
Real services, databases, third-party APIs |
| Network |
Full network access |
| Database |
Live databases |
| Breaks build |
Generally no (see guidance below) |
Examples
A vertical E2E test verifying user lookup through a live web interface:
A browser-based E2E test using a tool like Playwright:
Anti-Patterns
- Using E2E tests as the primary safety net – this is the “ice cream cone” anti-pattern.
E2E tests are slow and fragile; the majority of your confidence should come from unit and
integration tests.
- Blocking the pipeline with horizontal E2E tests – these tests span too many teams and
failure surfaces. Run them asynchronously and review failures out of band.
- Ignoring flaky failures – E2E tests often fail for environmental reasons. Track the
frequency and root cause of failures. If a test is not providing signal, fix it or remove
it.
- Testing edge cases in E2E – exhaustive input validation and error-path testing should
happen in cheaper, faster test types.
- Not capturing failure context – E2E failures are expensive to debug. Capture
screenshots, network logs, and video recordings automatically on failure.
Connection to CD Pipeline
E2E tests run in the later stages of the delivery pipeline, after the build artifact has
passed all deterministic tests and has been deployed to a staging or pre-production
environment:
- Post-deployment smoke tests – a small, fast suite of vertical E2E tests verifies that
the deployment succeeded and critical paths work.
- Scheduled regression suites – broader E2E suites (including horizontal tests) run on a
schedule rather than on every commit.
- Production monitoring – customer experience alarms (synthetic monitoring) are a form of
continuous E2E testing that runs in production.
Because E2E tests are non-deterministic, they should not break the build in most cases. A
team may choose to gate on a small set of highly reliable vertical E2E tests, but must invest
in reducing false positives to make this valuable. CD pipelines should be optimized for rapid
recovery of production issues rather than attempting to prevent all defects with slow,
fragile E2E gates.
This content is adapted from the Dojo Consortium,
licensed under CC BY 4.0.
8.7.5 - Contract Tests
Non-deterministic tests that validate test doubles by verifying API contract format against live external systems.
Definition
A contract test validates that the test doubles used in
integration tests still accurately represent the real external system.
Contract tests run against the live external sub-system and exercise the portion of the
code that interfaces with it. Because they depend on live services, contract tests are
non-deterministic and should not break the build. Instead, failures should trigger a
review to determine whether the contract has changed and the test doubles need updating.
A contract test validates contract format, not specific data. It verifies that response
structures, field names, types, and status codes match expectations – not that particular
values are returned.
Contract tests have two perspectives:
- Provider – the team that owns the API verifies that all changes are backwards compatible
(unless a new API version is introduced). Every build should validate the provider contract.
- Consumer – the team that depends on the API verifies that they can still consume the
properties they need, following
Postel’s Law: “Be conservative in
what you do, be liberal in what you accept from others.”
When to Use
- You have integration tests that use test doubles (mocks, stubs, recorded
responses) to represent external services, and you need assurance those doubles remain
accurate.
- You consume a third-party or cross-team API that may change without notice.
- You provide an API to other teams and want to ensure that your changes do not break their
expectations (consumer-driven contracts).
- You are adopting contract-driven development, where contracts are defined during design
so that provider and consumer teams can work in parallel using shared mocks and fakes.
Characteristics
| Property |
Value |
| Speed |
Seconds (depends on network latency) |
| Determinism |
Non-deterministic (hits live services) |
| Scope |
Interface boundary between two systems |
| Dependencies |
Live external sub-system |
| Network |
Yes – calls the real dependency |
| Database |
Depends on the provider |
| Breaks build |
No – failures trigger review, not build failure |
Examples
A provider contract test verifying that an API response matches the expected schema:
A consumer-driven contract test using Pact:
Anti-Patterns
- Using contract tests to validate business logic – contract tests verify structure and
format, not behavior. Business logic belongs in functional tests.
- Breaking the build on contract test failure – because these tests hit live systems,
failures may be caused by network issues or temporary outages, not actual contract changes.
Treat failures as signals to investigate.
- Neglecting to update test doubles – when a contract test fails because the upstream API
changed, the test doubles in your integration tests must be updated to match. Ignoring
failures defeats the purpose.
- Running contract tests too infrequently – the frequency should be proportional to the
volatility of the interface. Highly active APIs need more frequent contract validation.
- Testing specific data values – asserting that
name equals "Alice" makes the test
brittle. Assert on types, required fields, and response codes instead.
Connection to CD Pipeline
Contract tests run asynchronously from the main CI build, typically on a schedule:
- Provider side – provider contract tests (schema validation, response code checks) are
often implemented as deterministic unit tests and run on every commit as part of the
provider’s CI pipeline.
- Consumer side – consumer contract tests run on a schedule (e.g., hourly or daily)
against the live provider. Failures are reviewed and may trigger updates to test doubles
or conversations between teams.
- Consumer-driven contracts – when using tools like Pact, the consumer publishes
contract expectations and the provider runs them continuously. Both teams communicate when
contracts break.
Contract tests are the bridge that keeps your fast, deterministic integration test suite
honest. Without them, test doubles can silently drift from reality, and your integration
tests provide false confidence.
This content is adapted from the Dojo Consortium,
licensed under CC BY 4.0.
8.7.6 - Static Analysis
Code analysis tools that evaluate non-running code for security vulnerabilities, complexity, and best practice violations.
Definition
Static analysis (also called static testing) evaluates non-running code against rules for
known good practices. Unlike other test types that execute code and observe behavior, static
analysis inspects source code, configuration files, and dependency manifests to detect
problems before the code ever runs.
Static analysis serves several key purposes:
- Catches errors that would otherwise surface at runtime.
- Warns of excessive complexity that degrades the ability to change code safely.
- Identifies security vulnerabilities and coding patterns that provide attack vectors.
- Enforces coding standards by removing subjective style debates from code reviews.
- Alerts to dependency issues – outdated packages, known CVEs, license incompatibilities,
or supply-chain compromises.
When to Use
Static analysis should run continuously, at every stage where feedback is possible:
- In the IDE – real-time feedback as developers type, via editor plugins and language
server integrations.
- On save – format-on-save and lint-on-save catch issues immediately.
- Pre-commit – hooks prevent problematic code from entering version control.
- In CI – the full suite of static checks runs on every PR and on the trunk after merge,
verifying that earlier local checks were not bypassed.
Static analysis is always applicable. Every project, regardless of language or platform,
benefits from linting, formatting, and dependency scanning.
Characteristics
| Property |
Value |
| Speed |
Seconds (typically the fastest test category) |
| Determinism |
Always deterministic |
| Scope |
Entire codebase (source, config, dependencies) |
| Dependencies |
None (analyzes code at rest) |
| Network |
None (except dependency scanners) |
| Database |
None |
| Breaks build |
Yes |
Examples
Linting
A .eslintrc.json configuration enforcing test quality rules:
Type Checking
TypeScript catches type mismatches at compile time, eliminating entire classes of runtime
errors:
Dependency Scanning
Tools like npm audit, Snyk, or Dependabot scan for known vulnerabilities:
Types of Static Analysis
| Type |
Purpose |
| Linting |
Catches common errors and enforces best practices |
| Formatting |
Enforces consistent code style, removing subjective debates |
| Complexity analysis |
Flags overly deep or long code blocks that breed defects |
| Type checking |
Prevents type-related bugs, replacing some unit tests |
| Security scanning |
Detects known vulnerabilities and dangerous coding patterns |
| Dependency scanning |
Checks for outdated, hijacked, or insecurely licensed deps |
Anti-Patterns
- Disabling rules instead of fixing code – suppressing linter warnings or ignoring
security findings erodes the value of static analysis over time.
- Not customizing rules – default rulesets are a starting point. Write custom rules for
patterns that come up repeatedly in code reviews.
- Running static analysis only in CI – by the time CI reports a formatting error, the
developer has context-switched. IDE plugins and pre-commit hooks provide immediate feedback.
- Ignoring dependency vulnerabilities – known CVEs in dependencies are a direct attack
vector. Treat high-severity findings as build-breaking.
- Treating static analysis as optional – static checks should be mandatory and enforced.
If developers can bypass them, they will.
Connection to CD Pipeline
Static analysis is the first gate in the CD pipeline, providing the fastest feedback:
- IDE / local development – plugins run in real time as code is written.
- Pre-commit – hooks run linters and formatters, blocking commits that violate rules.
- PR verification – CI runs the full static analysis suite (linting, type checking,
security scanning, dependency auditing) and blocks merge on failure.
- Trunk verification – the same checks re-run on the merged HEAD to catch anything
missed.
- Scheduled scans – dependency and security scanners run on a schedule to catch newly
disclosed vulnerabilities in existing dependencies.
Because static analysis requires no running code, no test environment, and no external
dependencies, it is the cheapest and fastest form of quality verification. A mature CD
pipeline treats static analysis failures the same as test failures: they break the build.
This content is adapted from the Dojo Consortium,
licensed under CC BY 4.0.
8.7.7 - Test Doubles
Patterns for isolating dependencies in tests: stubs, mocks, fakes, spies, and dummies.
Definition
Test doubles are stand-in objects that replace real production dependencies during testing.
The term comes from the film industry’s “stunt double” – just as a stunt double replaces an
actor for dangerous scenes, a test double replaces a costly or non-deterministic dependency
to make tests fast, isolated, and reliable.
Test doubles allow you to:
- Remove non-determinism by replacing network calls, databases, and file systems with
predictable substitutes.
- Control test conditions by forcing specific states, error conditions, or edge cases that
would be difficult to reproduce with real dependencies.
- Increase speed by eliminating slow I/O operations.
- Isolate the system under test so that failures point directly to the code being tested,
not to an external dependency.
Types of Test Doubles
| Type |
Description |
Example Use Case |
| Dummy |
Passed around but never actually used. Fills parameter lists. |
A required logger parameter in a constructor. |
| Stub |
Provides canned answers to calls made during the test. Does not respond to anything outside what is programmed. |
Returning a fixed user object from a repository. |
| Spy |
A stub that also records information about how it was called (arguments, call count, order). |
Verifying that an analytics event was sent once. |
| Mock |
Pre-programmed with expectations about which calls will be made. Verification happens on the mock itself. |
Asserting that sendEmail() was called with specific arguments. |
| Fake |
Has a working implementation, but takes shortcuts not suitable for production. |
An in-memory database replacing PostgreSQL. |
Choosing the Right Double
- Use stubs when you need to supply data but do not care how it was requested.
- Use spies when you need to verify call arguments or call count.
- Use mocks when the interaction itself is the primary thing being verified.
- Use fakes when you need realistic behavior but cannot use the real system.
- Use dummies when a parameter is required by the interface but irrelevant to the test.
When to Use
Test doubles are used in every layer of deterministic testing:
- Unit tests – nearly all dependencies are replaced with test doubles to
achieve full isolation.
- Integration tests – external sub-systems (APIs, databases, message
queues) are replaced, but internal collaborators remain real.
- Functional tests – dependencies that cross the sub-system boundary
are replaced to maintain determinism.
Test doubles should be used less in later pipeline stages.
End-to-end tests use no test doubles by design.
Examples
A JavaScript stub providing a canned response:
A Java spy verifying interaction:
A fake in-memory repository:
Anti-Patterns
- Mocking what you do not own – wrapping a third-party API in a thin adapter and mocking
the adapter is safer than mocking the third-party API directly. Direct mocks couple your
tests to the library’s implementation.
- Over-mocking – replacing every collaborator with a mock turns the test into a mirror of
the implementation. Tests become brittle and break on every refactor. Only mock what is
necessary to maintain determinism.
- Not validating test doubles – if the real dependency changes its contract, your test
doubles silently drift. Use contract tests to keep doubles honest.
- Complex mock setup – if setting up mocks requires dozens of lines, the system under test
may have too many dependencies. Consider refactoring the production code rather than adding
more mocks.
- Using mocks to test implementation details – asserting on the exact sequence and count
of internal method calls creates change-detector tests. Prefer asserting on observable
output.
Connection to CD Pipeline
Test doubles are a foundational technique that enables the fast, deterministic tests required
for continuous delivery:
- Early pipeline stages (static analysis, unit tests, integration tests) rely heavily on
test doubles to stay fast and deterministic. This is where the majority of defects are
caught.
- Later pipeline stages (E2E tests, production monitoring) use fewer or no test doubles,
trading speed for realism.
- Contract tests run asynchronously to validate that test doubles still match reality,
closing the gap between the deterministic and non-deterministic stages of the pipeline.
The guiding principle from Justin Searls applies: “Don’t poke too many holes in reality.”
Use test doubles when you must, but prefer real implementations when they are fast and
deterministic.
This content is adapted from the Dojo Consortium,
licensed under CC BY 4.0.