This is the multi-page printable view of this section.
Click here to print.
Return to the regular view of this page.
Quality and Delivery Anti-Patterns
Start here. Find the anti-patterns your team is facing and learn the path to solving them.
Every team migrating to continuous delivery faces obstacles. Most are not unique to your team,
your technology, or your industry. This section catalogs the anti-patterns that hurt quality,
increase rework, and make delivery timelines unpredictable - then provides a concrete path to
fix each one.
Start with the problem you feel most. Each page links to the practices and migration phases
that address it.
Anti-pattern index
Sorted by quality impact so you can prioritize what to fix first.
1 - Team Workflow
Anti-patterns in how teams assign, coordinate, and manage the flow of work.
These anti-patterns affect how work moves through the team. They create bottlenecks, hide
problems, and prevent the steady flow of small changes that continuous delivery requires.
1.1 - Pull Request Review Bottlenecks
Pull requests sit for days waiting for review. Reviews happen in large batches. Authors have moved on by the time feedback arrives.
Category: Team Workflow | Quality Impact: High
What This Looks Like
A developer opens a pull request and waits. Hours pass. A day passes. They ping someone in chat.
The reviewer is busy with their own work. Eventually, late in the afternoon or the next morning,
comments arrive. The author has moved on to something else and has to reload context to respond.
Another round of comments. Another wait. The PR finally merges two or three days after it was
opened.
Common variations:
- The aging PR queue. The team has five or more open PRs at any given time. Some are days old.
Developers start new work while they wait, which creates more PRs, which creates more review
load, which slows reviews further.
- The designated reviewer. One or two senior developers review everything. They are
overwhelmed. Their review queue is a bottleneck that the rest of the team works around by
starting more work while they wait.
- The drive-by approval. Reviews are so slow that the team starts rubber-stamping PRs to
unblock each other. The review step exists in name only. Quality drops, but at least things
merge.
- The nitpick spiral. Reviewers leave dozens of style comments on formatting, naming, and
conventions that could be caught by a linter. Each round triggers another round. A 50-line
change accumulates 30 comments across three review cycles.
- The “I’ll get to it” pattern. When asked about a pending review, the answer is always “I’ll
look at it after I finish this.” But they never finish “this” because they have their own work,
and reviewing someone else’s code is never the priority.
The telltale sign: the team tracks PR age and the average is measured in days, not hours.
Why This Is a Problem
Slow code review is not just an inconvenience. It is a systemic bottleneck that undermines
continuous integration, inflates cycle time, and degrades the quality it is supposed to protect.
It blocks continuous integration
Trunk-based development requires integrating to trunk at least once per day. A PR that sits for
two days makes daily integration impossible. The branch diverges from trunk while it waits. Other
developers make changes to the same files. By the time the review is done, the PR has merge
conflicts that require additional work to resolve.
This is a compounding problem. Slow reviews cause longer-lived branches. Longer-lived branches
cause larger merge conflicts. Larger merge conflicts make integration painful. Painful integration
makes the team dread merging, which makes them delay opening PRs until the work is “complete,”
which makes PRs larger, which makes reviews take longer.
In teams where reviews complete within hours, branches rarely live longer than a day. Merge
conflicts are rare because changes are small and trunk has not moved far since the branch was
created.
It inflates cycle time
Every hour a PR waits for review is an hour added to cycle time. For a story that takes four hours
to code, a two-day review wait means the review step dominates the total cycle time. The coding
was fast. The pipeline is fast. But the work sits idle for days because a human has not looked at
it yet.
This wait time is pure waste. Nothing is happening to the code while it waits. No value is being
delivered. The change is done but not integrated, tested in the full pipeline, or deployed. It is
inventory sitting on the shelf.
When reviews happen within two hours, the review step nearly disappears from the cycle time
measurement. Code flows from development to trunk to production with minimal idle time.
It degrades the review quality it is supposed to protect
Slow reviews produce worse reviews, not better ones. When a reviewer sits down to review a PR that
was opened two days ago, they have no context on the author’s thinking. They review the code cold,
missing the intent behind decisions. They leave comments that the author already considered and
rejected, triggering unnecessary back-and-forth.
Large PRs make this worse. When a review has been delayed, the author often keeps working on the
same branch, adding more changes to avoid opening a second PR while the first one waits. What
started as a 50-line change becomes a 300-line change. Research consistently shows that reviewer
effectiveness drops sharply after 200 lines. Large PRs get superficial reviews - the reviewer
skims the diff, leaves a few surface-level comments, and approves because they do not have time
to review it thoroughly.
Fast reviews are better reviews. A reviewer who looks at a 50-line change within an hour of it
being opened has full context on what the team is working on, can ask the author questions in real
time, and can give focused attention to a small, digestible change.
It creates hidden WIP
Every open PR is work in progress. The code is written but not integrated. The developer who
authored it has moved on to something new, but their previous work is still “in progress” from the
team’s perspective. A team of five with eight open PRs has eight items of hidden WIP that do not
appear on the sprint board as “in progress” but consume the same attention.
This hidden WIP interacts badly with explicit WIP. A developer who has one story “in progress” on
the board but three PRs waiting for review is actually juggling four streams of work. Each PR that
gets comments requires a context switch back to code they wrote days ago. The cognitive overhead is
real even if the board does not show it.
Impact on continuous delivery
Continuous delivery requires that every change move from commit to production quickly and
predictably. Review bottlenecks create an unpredictable queue between “code complete” and
“integrated.” The queue length varies based on reviewer availability, competing priorities, and
team habits. Some PRs merge in hours, others take days. This variability makes delivery timelines
unpredictable and prevents the steady flow of small changes that CD depends on.
The bottleneck also discourages the small, frequent changes that make CD safe. Developers learn
that every PR costs a multi-day wait, so they batch changes into larger PRs to reduce the number
of times they pay that cost. Larger PRs are riskier, harder to review, and more likely to cause
problems - exactly the opposite of what CD requires.
How to Fix It
Step 1: Measure review turnaround time (Week 1)
You cannot fix what you do not measure. Start tracking two numbers:
- Time to first review: elapsed time from PR opened to first reviewer comment or approval.
- PR age at merge: elapsed time from PR opened to PR merged.
Most teams discover their average is far worse than they assumed. Developers think reviews happen
in a few hours. The data shows days.
Step 2: Set a team review SLA (Week 1)
Agree as a team on a review turnaround target. A reasonable starting point:
- Reviews within 2 hours during working hours.
- PR age at merge under 24 hours.
Write this down as a working agreement. Post it on the board. This is not a suggestion - it is a
team commitment.
Step 3: Make reviews a first-class activity (Week 2)
The core behavior change: reviewing code is not something you do when you have spare time. It is
the highest-priority activity after your current task reaches a natural stopping point.
Concrete practices:
- Check for open PRs before starting new work. When a developer finishes a task or hits a
natural pause, their first action is to check for pending reviews, not to pull a new story.
- Auto-assign reviewers. Do not wait for someone to volunteer. Configure your tools to
assign a reviewer automatically when a PR is opened.
- Rotate reviewers. Do not let one or two people carry all the review load. Any team member
should be able to review any PR. This spreads knowledge and distributes the work.
- Keep PRs small. Target under 200 lines of changed code. Small PRs get reviewed faster and
more thoroughly. If a developer says their PR is “too large to split,” that is a work
decomposition problem.
Step 4: Consider synchronous review (Week 3+)
The fastest review is one that happens in real time. If async review consistently exceeds the
team’s SLA, move toward synchronous alternatives:
| Method |
How it works |
Review wait time |
| Pair programming |
Two developers write the code together. Review is continuous. |
Zero |
| Over-the-shoulder |
Author walks reviewer through the change on a call. |
Minutes |
| Rapid async |
PR opened, reviewer notified, review within 2 hours. |
Under 2 hours |
| Traditional async |
PR opened, reviewer gets to it when they can. |
Hours to days |
Pair programming eliminates the review bottleneck entirely. The code is reviewed as it is written.
There is no PR, no queue, and no wait. For teams that struggle with review latency, pairing is
often the most effective solution.
Step 5: Address the objections
| Objection |
Response |
| “I can’t drop what I’m doing to review” |
You are not dropping everything. You are checking for reviews at natural stopping points: after a commit, after a test passes, after a meeting. Reviews that take 10 minutes should not require “dropping” anything. |
| “Reviews take too long because the PRs are too big” |
Then the PRs need to be smaller. A 50-line change takes 5-10 minutes to review. The review is not the bottleneck - the PR size is. |
| “Only senior developers can review this code” |
That is a knowledge silo. Rotate reviewers so that everyone builds familiarity with every part of the codebase. Junior developers reviewing senior code is learning. Senior developers reviewing junior code is mentoring. Both are valuable. |
| “We need two reviewers for compliance” |
Check whether your compliance framework actually requires two human reviewers, or whether it requires two sets of eyes on the code. Pair programming satisfies most separation-of-duties requirements while eliminating review latency. |
| “We tried faster reviews and quality dropped” |
Fast does not mean careless. Automate style checks so reviewers focus on logic, correctness, and design. Small PRs get better reviews than large ones regardless of speed. |
Measuring Progress
| Metric |
What to look for |
| Time to first review |
Should drop below 2 hours |
| PR age at merge |
Should drop below 24 hours |
| Open PR count |
Should stay low - ideally fewer than the number of team members |
| PR size (lines changed) |
Should trend below 200 lines |
| Review rework cycles |
Should stay under 2 rounds per PR |
| Development cycle time |
Should decrease as review wait time drops |
Related Content
1.2 - Work Items Too Large
Work items regularly take more than a week. Developers work on a single item for days without integrating.
Category: Team Workflow | Quality Impact: High
What This Looks Like
A developer picks up a work item on Monday. By Wednesday, they are still working on it. By Friday,
it is “almost done.” The following Monday, they are fixing edge cases. The item finally moves to
review mid-week - a 300-line pull request that the reviewer does not have time to look at
carefully.
Common variations:
- The week-long item. Work items routinely take five or more days. Developers work on a single
item for an entire sprint without integrating to trunk. The branch diverges further every day.
- The “it’s really just one thing” item. A ticket titled “Add user profile page” hides a
login form, avatar upload, email verification, notification preferences, and password reset.
It looks like one feature to the product owner. It is six features to the developer.
- The point-inflated item. The team estimates work at 8 or 13 points. Nobody questions
whether an 8-point item should be decomposed. High estimates are treated as a property of the
work rather than a signal that the work is too big.
- The “spike that became a feature.” A time-boxed investigation turns into an implementation.
The developer keeps going because they have momentum, and the result is a large, unreviewed
change that was never planned or decomposed.
- The horizontal slice. Work is split by technical layer: “build the database schema,”
“build the API,” “build the UI.” Each item takes days because it spans the entire layer.
Nothing is deployable until all three are done.
The telltale sign: look at the team’s cycle time distribution. If work items regularly take five
or more days from start to done, the items are too large.
Why This Is a Problem
Large work items are not just slow. They are a compounding force that makes every other part of
the delivery process worse.
They prevent daily integration
Trunk-based development requires integrating to trunk at least once per day. A work item that
takes a week to complete cannot be integrated daily unless it is decomposed into smaller pieces
that are each independently integrable. Most teams with large work items do not decompose them -
they work on a branch for the full duration and merge at the end.
This means a week of work is invisible to the rest of the team until it lands as a single large
merge. A week of assumptions go untested against the real state of trunk. A week of potential
merge conflicts accumulate silently.
When work items are small enough to complete in one to two days, each item is a natural
integration point. The developer finishes the item, integrates to trunk, and the change is
tested, reviewed, and deployed before the next item begins.
They make estimation meaningless
Large work items hide unknowns. An item estimated at 8 points might take three days or three
weeks depending on what the developer discovers along the way. The estimate is a guess wrapped in
false precision.
This makes planning unreliable. The team commits to a set of large items, discovers mid-sprint
that one of them is twice as big as expected, and scrambles at the end. The retrospective
identifies “estimation accuracy” as the problem, but the real problem is that the items were too
big to estimate accurately in the first place.
Small work items are inherently more predictable. An item that takes one to two days has a narrow
range of uncertainty. Even if the estimate is off, it is off by hours, not weeks. Plans built
from small items are more reliable because the variance of each item is small.
They increase rework
A developer working on a large item makes dozens of decisions over several days: architectural
choices, naming conventions, error handling approaches, API contracts. These decisions are made in
isolation. Nobody sees them until the code review, which happens after all the work is done.
When the reviewer disagrees with a fundamental decision made on day one, the developer has built
five days of work on top of it. The rework cost is enormous. They either rewrite large portions
of the code or the team accepts a suboptimal decision because the cost of changing it is too high.
With small items, decisions surface quickly. A one-day item produces a small pull request that is
reviewed within hours. If the reviewer disagrees with an approach, the cost of changing it is a
few hours of work, not a week. Fundamental design problems are caught early, before layers of
code are built on top of them.
They hide risk until the end
A large work item carries risk that is invisible until late in its lifecycle. The developer might
discover on day four that the chosen approach does not work, that an API they depend on behaves
differently than documented, or that the database cannot handle the query pattern they assumed.
When this discovery happens on day four of a five-day item, the options are bad: rush a fix, cut
scope, or miss the sprint commitment. The team had no visibility into the risk because the work
was a single opaque block on the board.
Small items surface risk early. If the approach does not work, the team discovers it on day one
of a one-day item. The cost of changing direction is minimal. The risk is contained to a small
unit of work rather than spreading across an entire feature.
Impact on continuous delivery
Continuous delivery is built on small, frequent, low-risk changes flowing through the pipeline.
Large work items produce the opposite: infrequent, high-risk changes that batch up in branches
and land as large merges.
A team with five developers working on five large items has zero deployable changes for days at a
time. Then several large changes land at once, the pipeline is busy for hours, and conflicts
between the changes create unexpected failures. This is batch-and-queue delivery wearing agile
clothing.
The feedback loop is broken too. A small change deployed to production gives immediate signal:
does the change work? Does it affect performance? Do users behave as expected? A large change
deployed after a week gives noisy signal: something changed, but which of the fifty modifications
caused the issue?
How to Fix It
Step 1: Establish the 2-day rule (Week 1)
Agree as a team: no work item should take longer than two days from start to integrated on
trunk.
This is not a velocity target. It is a constraint on item size. If an item cannot be completed
in two days, it must be decomposed before it is pulled into the sprint.
Write this as a working agreement and enforce it during planning. When someone estimates an item
at more than two days, the response is “how do we split this?” - not “who can do it faster?”
Step 2: Learn vertical slicing (Week 2)
The most common decomposition mistake is horizontal slicing - splitting by technical layer instead
of by user-visible behavior. Train the team on vertical slicing:
Horizontal (avoid):
| Work item |
Deployable? |
Testable end-to-end? |
| Build the database schema for orders |
No |
No |
| Build the API for orders |
No |
No |
| Build the UI for orders |
Only after all three are done |
Only after all three are done |
Vertical (prefer):
| Work item |
Deployable? |
Testable end-to-end? |
| User can create a basic order (DB + API + UI) |
Yes |
Yes |
| User can add a discount to an order |
Yes |
Yes |
| User can view order history |
Yes |
Yes |
Each vertical slice cuts through all layers to deliver a thin piece of complete functionality.
Each is independently deployable and testable. Each gives feedback before the next slice begins.
Step 3: Use acceptance criteria as a splitting signal (Week 2+)
Count the acceptance criteria on each work item. If an item has more than three to five acceptance
criteria, it is probably too big. Each criterion or small group of criteria can become its own
item.
Write acceptance criteria in concrete Given-When-Then format. Each scenario is a natural
decomposition boundary:
Each scenario can be implemented, integrated, and deployed independently.
Step 4: Decompose during refinement, not during the sprint (Week 3+)
Work items should arrive at planning already decomposed. If the team is splitting items
mid-sprint, refinement is not doing its job.
During backlog refinement:
- Product owner presents the feature or outcome.
- Team discusses the scope and writes acceptance criteria.
- If the item has more than three to five criteria, split it immediately.
- Each resulting item is estimated. Any item over two days is split again.
- Items enter the sprint already small enough to flow.
Step 5: Address the objections
| Objection |
Response |
| “Splitting creates too many items to manage” |
Small items are easier to manage, not harder. They have clear scope, predictable timelines, and simple reviews. The overhead per item should be near zero. If it is not, simplify your process. |
| “Some things can’t be done in two days” |
Almost anything can be decomposed further. Database migrations can be done in backward-compatible steps. UI changes can be hidden behind feature flags. The skill is finding the decomposition, not deciding whether one exists. |
| “We’ll lose the big picture if we split too much” |
The epic or feature still exists as an organizing concept. Small items are not independent fragments - they are ordered steps toward a defined outcome. Use an epic to track the overall feature and individual items to track the increments. |
| “Product doesn’t want partial features” |
Feature flags let you deploy incomplete features without exposing them to users. The code is integrated and tested continuously, but the user-facing feature is toggled on only when all slices are done. |
| “Our estimates are fine, items just take longer than expected” |
That is the definition of items being too big. Small items have narrow estimation variance. If a one-day item takes two days, you are off by a day. If a five-day item takes ten, you have lost a sprint. |
Measuring Progress
| Metric |
What to look for |
| Item cycle time |
Should be two days or less from start to trunk |
| Development cycle time |
Should decrease as items get smaller |
| Items completed per week |
Should increase even if total output stays the same |
| Integration frequency |
Should increase as developers integrate completed items daily |
| Items that exceed the 2-day rule |
Track violations and discuss in retrospectives |
| Work in progress |
Should decrease as smaller items flow through faster |
Related Content
1.3 - No Vertical Slicing
Work is organized by technical layer - “build the API,” “build the UI” - rather than by user-visible behavior. Nothing is deployable until all layers are done.
Category: Team Workflow | Quality Impact: Medium
What This Looks Like
The team breaks a feature into work items by architectural layer. One item for the database
schema. One for the API. One for the frontend. Maybe one for “integration testing” at the end.
Each item lives in a different lane or is assigned to a different specialist. Nothing reaches
production until the last layer is finished and all the pieces are stitched together.
Common variations:
- Layer-based assignment. “The backend team builds the API, the frontend team builds the UI.”
Each team delivers their layer independently. Integration is a separate phase that happens after
both teams are “done.”
- The database-first approach. Every feature starts with “build the schema.” Weeks of database
work happen before any API or UI exists. The schema is designed for the complete feature rather
than for the first thin slice.
- The API-then-UI pattern. The API is built and “tested” in isolation with Postman or curl.
The UI is built weeks later against the API. Mismatches between what the API provides and what
the UI needs are discovered at the end.
- The “integration sprint.” After the layers are built separately, the team dedicates a sprint
to wiring everything together. This sprint always takes longer than planned because the layers
were built on different assumptions.
- Technical stories on the board. The backlog contains items like “create database indexes,”
“add caching layer,” or “refactor service class.” None of these deliver user-visible value. They
are infrastructure work that has been separated from the feature it supports.
The telltale sign: ask “can we deploy this work item to production and have a user see something
different?” If the answer is no, the work is sliced horizontally.
Why This Is a Problem
Horizontal slicing feels natural to developers because it matches how they think about the
system’s architecture. But it optimizes for how the code is organized, not for how value is
delivered. The consequences compound across every dimension of delivery.
Nothing is deployable until everything is done
A horizontal slice delivers no user-visible value on its own. The database schema alone does
nothing. The API alone does nothing a user can see. The UI alone has no data to display. Value
only emerges when all layers are assembled - and that assembly happens at the end.
This means the team has zero deployable output for the entire duration of the feature build. A
feature that takes three sprints to build across layers produces three sprints of work in progress
and zero deliverables. The team is busy the entire time, but nothing reaches production.
With vertical slicing, every item is deployable. The first slice might be “user can create a
basic order” - thin, but it touches the database, API, and UI. It can be deployed to production
behind a feature flag on day two. Feedback starts immediately. The remaining slices build on a
working foundation rather than converging on an untested one.
Integration risk accumulates invisibly
When layers are built separately, each team or developer makes assumptions about how their layer
will connect to the others. The backend developer assumes the API contract looks a certain way.
The frontend developer assumes the response format matches their component design. The database
developer assumes the query patterns align with how the API will call them.
These assumptions are untested until integration. The longer the layers are built in isolation,
the more assumptions accumulate and the more likely they are to conflict. Integration becomes the
riskiest phase of the project - the phase where all the hidden mismatches surface at once.
With vertical slicing, integration happens with every item. The first slice forces the developer
to connect all the layers immediately. Assumptions are tested on day one, not month three.
Subsequent slices extend a working, integrated system rather than building isolated components
that have never talked to each other.
Feedback is delayed until it is expensive to act on
A horizontal approach delays user feedback until the full feature is assembled. If the team builds
the wrong thing - misunderstands a requirement, makes a poor UX decision, or solves the wrong
problem - they discover it after weeks of work across multiple layers.
At that point, the cost of changing direction is enormous. The database schema, API contracts, and
UI components all need to be reworked. The team has already invested heavily in an approach that
turns out to be wrong.
Vertical slicing delivers feedback with every increment. The first slice ships a thin version of
the feature that real users can see. If the approach is wrong, the team discovers it after a day
or two of work, not after a month. The cost of changing direction is the cost of one small item,
not the cost of an entire feature.
It creates specialist dependencies and handoff delays
Horizontal slicing naturally leads to specialist assignment: the database expert takes the
database item, the API expert takes the API item, the frontend expert takes the frontend item.
Each person works in isolation on their layer, and the work items have dependencies between them -
the API cannot be built until the schema exists, the UI cannot be built until the API exists.
These dependencies create sequential handoffs. The database work finishes, but the API developer
is busy with something else. The API work finishes, but the frontend developer is mid-sprint on
a different feature. Each handoff introduces wait time that has nothing to do with the complexity
of the work.
Vertical slicing eliminates these dependencies. A single developer (or pair) implements the full
slice across all layers. There are no handoffs between layers because one person owns the entire
thin slice from database to UI. This also spreads knowledge - developers who work across all
layers understand the full system, not just their specialty.
Impact on continuous delivery
Continuous delivery requires a continuous flow of small, independently deployable changes.
Horizontal slicing produces the opposite: a batch of interdependent layer changes that can only
be deployed together after a separate integration phase.
A team that slices horizontally cannot deploy continuously because there is nothing to deploy
until all layers converge. They cannot get production feedback because nothing user-visible exists
until the end. They cannot limit risk because the first real test of the integrated system happens
after all the work is done.
The pipeline itself becomes less useful. When changes are horizontal slices, the pipeline can only
verify that one layer works in isolation - it cannot run meaningful end-to-end tests until all
layers exist. The pipeline gives a false green signal (“the API tests pass”) that hides the real
question (“does the feature work?”).
How to Fix It
Step 1: Learn to recognize horizontal slices (Week 1)
Before changing how the team slices, build awareness. Review the current sprint board and backlog.
For each work item, ask:
- Can a user (or another system) observe the change after this item is deployed?
- Can I write an end-to-end test for this item alone?
- Does this item deliver value without waiting for other items to be completed?
If the answer to any of these is no, the item is likely a horizontal slice. Tag these items and
count them. Most teams discover that a majority of their backlog is horizontally sliced.
Step 2: Reslice one feature vertically (Week 2)
Pick one upcoming feature and practice reslicing it. Start with the current horizontal breakdown
and convert it:
Before (horizontal):
- Create the database tables for notifications
- Build the notification API endpoints
- Build the notification preferences UI
- Integration testing for notifications
After (vertical):
- User receives an email notification when their order ships (DB + API + email + minimal UI)
- User can view notification history on their profile page
- User can disable email notifications for order updates
- User can choose between email and SMS for shipping notifications
Each vertical slice is independently deployable and testable end-to-end. Each delivers something
a user can see. The team gets feedback after item 1 instead of after item 4.
Step 3: Use the deployability test in refinement (Week 3+)
Make the deployability test a standard part of backlog refinement. For every proposed work item,
ask: “If this were the only thing we shipped this sprint, would a user notice?”
If not, the item needs reslicing. This single question catches most horizontal slices before they
enter the sprint.
Complement this with concrete acceptance criteria in Given-When-Then format. Each scenario should
describe observable behavior, not technical implementation:
- Good: “Given a registered user, when they update their email, then a verification link is sent
to the new address”
- Bad: “Build the email verification API endpoint”
Step 4: Break the specialist habit (Week 4+)
Horizontal slicing and specialist assignment reinforce each other. As long as “the backend
developer does the backend work,” slicing by layer feels natural.
Break this cycle:
- Have developers work full-stack on vertical slices. A developer who implements the entire
slice - database, API, and UI - will naturally slice vertically because they own the full
delivery.
- Pair a specialist with a generalist. If a developer is uncomfortable with a particular
layer, pair them with someone who knows it. This builds cross-layer skills while delivering
vertical slices.
- Rotate who works on what. Do not let the same person always take the database items. When
anyone can work on any layer, the team stops organizing work by layer.
Step 5: Address the objections
| Objection |
Response |
| “Our developers are specialists - they can’t work across layers” |
That is a skill gap, not a constraint. Pairing a frontend developer with a backend developer on a vertical slice builds the missing skills while delivering the work. The short-term slowdown produces long-term flexibility. |
| “The database schema needs to be designed holistically” |
Design the schema incrementally. Add the columns and tables needed for the first slice. Extend them for the second. This is how trunk-based database evolution works - backward-compatible, incremental changes. Designing the “complete” schema upfront leads to speculative design that changes anyway. |
| “Vertical slices create duplicate work across layers” |
They create less total work because integration problems are caught immediately instead of accumulating. The “duplicate” concern usually means the team is building more infrastructure than the current slice requires. Build only what the current slice needs. |
| “Some work is genuinely infrastructure” |
True infrastructure work (setting up a new database, provisioning a service) still needs to be connected to a user outcome. “Provision the notification service and send one test notification” is a vertical slice that includes the infrastructure. |
| “Our architecture makes vertical slicing hard” |
That is a signal about the architecture. Tightly coupled layers that cannot be changed independently are a deployment risk. Vertical slicing exposes this coupling early, which is better than discovering it during a high-stakes integration phase. |
Measuring Progress
| Metric |
What to look for |
| Percentage of work items that are independently deployable |
Should increase toward 100% |
| Time from feature start to first production deploy |
Should decrease as the first vertical slice ships early |
| Development cycle time |
Should decrease as items no longer wait for other layers |
| Integration issues discovered late |
Should decrease as integration happens with every slice |
| Integration frequency |
Should increase as deployable slices are completed and merged daily |
Related Content
1.4 - Too Much Work in Progress
Every developer is on a different story. Eight items in progress, zero done. Nothing gets the focused attention needed to finish.
Category: Team Workflow | Quality Impact: High
What This Looks Like
Open the team’s board on any given day. Count the items in progress. Now count the team members.
If the first number is significantly higher than the second, the team has a WIP problem.
Common variations:
- Everyone on a different story. A team of five has eight or more stories in progress. Nobody
is working on the same thing. The board is a wide river of half-finished work.
- Sprint-start explosion. On the first day of the sprint, every developer pulls a story. By
mid-sprint, all stories are “in progress” and none are “done.” The last day is a scramble to
close anything.
- Individual WIP hoarding. A single developer has three stories assigned: one they’re actively
coding, one waiting for review, and one blocked on a question. They count all three as “in
progress” and start nothing new - but they also don’t help anyone else finish.
- Hidden WIP. The board shows five items in progress, but each developer is also investigating
a production bug, answering questions about a previous story, and prototyping something for next
sprint. Unofficial work doesn’t appear on the board but consumes the same attention.
- Expedite as default. Urgent requests arrive mid-sprint. Instead of replacing existing work,
they stack on top. WIP grows because nothing is removed when something is added.
The telltale sign: the team is busy all the time but finishes very little. Stories take longer and
longer to complete. The sprint ends with a pile of items at 80% done.
Why This Is a Problem
High WIP is not a sign of a productive team. It is a sign of a team that has optimized for
starting work instead of finishing it. The consequences compound over time.
It destroys focus and increases context switching
Every item in progress competes for a developer’s attention. A developer working on one story can
focus deeply. A developer juggling three stories - one active, one waiting for review, one they
need to answer questions about - is constantly switching context. Research consistently shows that
each additional concurrent task reduces productive time by 20-40%.
The switching cost is not just time. It is cognitive load. Developers lose their mental model of
the code when they switch away, and it takes 15-30 minutes to rebuild it when they switch back.
Multiply this across five context switches per day and the team is spending more time reloading
context than writing code.
In a low-WIP environment, developers finish one thing before starting the next. Deep focus is the
default. Context switching is the exception, not the rule.
It inflates cycle time
Little’s Law is not a suggestion. It is a mathematical relationship: cycle time equals work in
progress divided by throughput. If a team’s throughput is roughly constant (and over weeks, it is),
the only way to reduce cycle time is to reduce WIP.
A team of five with a throughput of ten stories per sprint and five stories in progress has an
average cycle time of half a sprint. The same team with fifteen stories in progress has an average
cycle time of 1.5 sprints. The work is not getting done faster because more of it was started. It
is getting done slower because all of it is competing for the same capacity.
Long cycle times create their own problems. Feedback is delayed. Requirements go stale.
Integration conflicts accumulate. The longer a story sits in progress, the more likely it is to
need rework when it finally reaches review or testing.
It hides bottlenecks
When WIP is high, bottlenecks are invisible. If code reviews are slow, a developer just starts
another story while they wait. If the test environment is broken, they work on something else. The
constraint is never confronted because there is always more work to absorb the slack.
This is comfortable but destructive. The bottleneck does not go away because the team is working
around it. It quietly degrades the system. Reviews pile up. Test environments stay broken. The
team’s real throughput is constrained by the bottleneck, but nobody feels the pain because they
are always busy.
When WIP is limited, bottlenecks become immediately visible. A developer who cannot start new work
because the WIP limit is reached has to swarm on something blocked. “I’m idle because my PR has
been waiting for review for two hours” is a problem the team can solve. “I just started another
story while I wait” hides the same problem indefinitely.
It prevents swarming and collaboration
When every developer has their own work in progress, there is no incentive to help anyone else.
Reviewing a teammate’s pull request, pairing on a stuck story, or helping debug a failing test all
feel like distractions from “my work.” The result is that every item moves through the pipeline
alone, at the pace of a single developer.
Swarming - multiple team members working together to finish the highest-priority item - is
impossible when everyone has their own stories to protect. If you ask a developer to drop their
current story and help finish someone else’s, you are asking them to fall behind on their own work.
The incentive structure is broken.
In a low-WIP environment, finishing the team’s most important item is everyone’s job. When only
three items are in progress for a team of five, two people are available to pair, review, or
unblock. Collaboration is the natural state, not a special request.
Impact on continuous delivery
Continuous delivery requires a steady flow of small, finished changes moving through the pipeline.
High WIP produces the opposite: a large batch of unfinished changes sitting in various stages of
completion, blocking each other, accumulating merge conflicts, and stalling in review queues.
A team with fifteen items in progress does not deploy fifteen times as often as a team with one
item in progress. They deploy less frequently because nothing is fully done. Each “almost done”
story is a small batch that has not yet reached the pipeline. The batch keeps growing until
something forces a reckoning - usually the end of the sprint.
The feedback loop breaks too. When changes sit in progress for days, the developer who wrote the
code has moved on by the time the review comes back or the test fails. They have to reload context
to address feedback, which takes more time, which delays the next change, which increases WIP
further. The cycle reinforces itself.
How to Fix It
Step 1: Make WIP visible (Week 1)
Before setting any limits, make the current state impossible to ignore.
- Count every item currently in progress for the team. Include stories, bugs, spikes, and any
unofficial work that is consuming attention.
- Write this number on the board. Update it daily.
- Most teams are shocked. A team of five typically discovers 12-20 items in progress once hidden
work is included.
Do not try to fix anything yet. The goal is awareness.
Step 2: Set an initial WIP limit (Week 2)
Use the N+2 formula as a starting point, where N is the number of team members actively
working on delivery.
| Team size |
Starting WIP limit |
Why |
| 3 developers |
5 items |
One per person plus a buffer for blocked items |
| 5 developers |
7 items |
Same ratio |
| 8 developers |
10 items |
Buffer shrinks proportionally |
Add the limit to the board as a column header or policy: “In Progress (limit: 7).” Agree as a
team that when the limit is reached, nobody starts new work.
Step 3: Enforce the limit with swarming (Week 3+)
When the WIP limit is reached and a developer finishes something, they have two options:
- Pull the next highest-priority item if the WIP count is below the limit.
- Swarm on an existing item if the WIP count is at the limit.
Swarming means pairing on a stuck story, reviewing a pull request, writing a test someone needs
help with, or resolving a blocker. The key behavior change: “I have nothing to do” is never the
right response. “What can I help finish?” is.
Step 4: Lower the limit over time (Monthly)
The initial limit is a starting point. Each month, consider reducing it by one.
| Limit |
What it exposes |
| N+2 |
Gross overcommitment. Most teams find this is already a significant reduction. |
| N+1 |
Slow reviews, environment contention, unclear requirements. Team starts swarming. |
| N |
Every person on one thing. Blocked items get immediate attention. |
| Below N |
Team is pairing by default. Cycle time drops sharply. |
Each reduction will feel uncomfortable. That discomfort is the point - it exposes constraints in
the workflow that were hidden by excess WIP.
Step 5: Address the objections
Expect resistance and prepare for it:
| Objection |
Response |
| “I’ll be idle if I can’t start new work” |
Idle hands are not the problem. Idle work is. Help finish something instead of starting something new. |
| “Management will see people not typing and think we’re wasting time” |
Track cycle time and throughput. When both improve, the data speaks for itself. |
| “We have too many priorities to limit WIP” |
Having many priorities is exactly why you need a WIP limit. Without one, nothing gets the focus needed to finish. Everything is “in progress,” nothing is done. |
| “What about urgent production issues?” |
Keep one expedite slot. If a production issue arrives, it takes the slot. If the slot is full, the new issue replaces the current one. Expedite is not a way to bypass the limit - it is part of the limit. |
| “Our stories are too big to pair on” |
That is a separate problem. See Work Decomposition. Stories should be small enough that anyone can pick them up. |
Measuring Progress
| Metric |
What to look for |
| Work in progress |
Should stay at or below the team’s limit |
| Development cycle time |
Should decrease as WIP drops |
| Stories completed per week |
Should stabilize or increase despite starting fewer items |
| Time items spend blocked |
Should decrease as the team swarms on blockers |
| Sprint-end scramble |
Should disappear as work finishes continuously through the sprint |
Related Content
1.5 - Push-Based Work Assignment
Work is assigned to individuals by a manager or lead instead of team members pulling the next highest-priority item.
Category: Team Workflow | Quality Impact: High
What This Looks Like
A manager, tech lead, or project manager decides who works on what. Assignments happen during
sprint planning, in one-on-ones, or through tickets pre-assigned before the sprint starts. Each
team member has “their” stories for the sprint. The assignment is rarely questioned.
Common variations:
- Assignment by specialty. “You’re the database person, so you take the database stories.” Work
is routed by perceived expertise rather than team priority.
- Assignment by availability. A manager looks at who is “free” and assigns the next item from
the backlog, regardless of what the team needs finished.
- Assignment by seniority. Senior developers get the interesting or high-priority work. Junior
developers get what’s left.
- Pre-loaded sprints. Every team member enters the sprint with their work already assigned. The
sprint board is fully allocated on day one.
The telltale sign: if you ask a developer “what should you work on next?” and the answer is “I
don’t know, I need to ask my manager,” work is being pushed.
Why This Is a Problem
Push-based assignment is one of the most quietly destructive practices a team can have. It
undermines nearly every CD practice by breaking the connection between the team and the flow of
work. Each of its effects compounds the others.
It reduces quality
Push assignment makes code review feel like a distraction from “my stories.” When every developer
has their own assigned work, reviewing someone else’s pull request is time spent not making progress
on your own assignment. Reviews sit for hours or days because the reviewer is busy with their own
work. The same dynamic discourages pairing: spending an hour helping a colleague means falling
behind on your own assignments, so developers don’t offer and don’t ask.
This means fewer eyes on every change. Defects that a second person would catch in minutes survive
into production. Knowledge stays siloed because there is no reason to look at code outside your
assignment. The team’s collective understanding of the codebase narrows over time.
In a pull system, reviewing code and unblocking teammates are the highest-priority activities
because finishing the team’s work is everyone’s work. Reviews happen quickly because they are not
competing with “my stories” - they are the work. Pairing happens naturally because anyone might
pick up any story, and asking for help is how the team moves its highest-priority item forward.
It increases rework
Push assignment routes work by specialty: “You’re the database person, so you take the database
stories.” This creates knowledge silos where only one person understands a part of the system.
When the same person always works on the same area, mistakes go unreviewed by anyone with a fresh
perspective. Assumptions go unchallenged because the reviewer lacks context to question them.
Misinterpretation of requirements also increases. The assigned developer may not have context on why
a story is high priority or what business outcome it serves - they received it as an assignment, not
as a problem to solve. When the result doesn’t match what was needed, the story comes back for
rework.
In a pull system, anyone might pick up any story, so knowledge spreads across the team. Fresh eyes
catch assumptions that a domain expert would miss. Developers who pull a story engage with its
priority and purpose because they chose it from the top of the backlog. Rework drops because more
perspectives are involved earlier.
It makes delivery timelines unpredictable
Push assignment optimizes for utilization - keeping everyone busy - not for flow - getting things
done. Every developer has their own assigned work, so team WIP is the sum of all individual
assignments. There is no mechanism to say “we have too much in progress, let’s finish something
first.” WIP limits become meaningless when the person assigning work doesn’t see the full picture.
Bottlenecks are invisible because the manager assigns around them instead of surfacing them. If one
area of the system is a constraint, the assigner may not notice because they are looking at people,
not flow. In a pull system, the bottleneck becomes obvious: work piles up in one column and nobody
pulls it because the downstream step is full.
Workloads are uneven because managers cannot perfectly predict how long work will take. Some people
finish early and sit idle or start low-priority work, while others are overloaded. Feedback loops
are slow because the order of work is decided at sprint planning; if priorities change mid-sprint,
the manager must reassign. Throughput becomes erratic - some sprints deliver a lot, others very
little, with no clear pattern.
In a pull system, workloads self-balance: whoever finishes first pulls the next item. Bottlenecks
are visible. WIP limits actually work because the team collectively decides what to start. The team
automatically adapts to priority changes because the next person who finishes simply pulls whatever
is now most important.
It removes team ownership
Pull systems create shared ownership of the backlog. The team collectively cares about the priority
order because they are collectively responsible for finishing work. Push systems create individual
ownership: “that’s not my story.” When a developer finishes their assigned work, they wait for more
assignments instead of looking at what the team needs.
This extends beyond task selection. In a push system, developers stop thinking about the team’s
goals and start thinking about their own assignments. Swarming - multiple people collaborating to
finish the highest-priority item - is impossible when everyone “has their own stuff.” If a story is
stuck, the assigned developer struggles alone while teammates work on their own assignments.
The unavailability problem makes this worse. When each person works in isolation on “their” stories,
the rest of the team has no context on what that person is doing, how the work is structured, or
what decisions have been made. If the assigned person is out sick, on vacation, or leaves the
company, nobody can pick up where they left off. The work either stalls until that person returns or
another developer starts over - rereading requirements, reverse-engineering half-finished code, and
rediscovering decisions that were never shared. In a pull system, the team maintains context on
in-progress work because anyone might have pulled it, standups focus on the work rather than
individual status, and pairing spreads knowledge continuously. When someone is unavailable, the
next person simply picks up the item with enough shared context to continue.
Impact on continuous delivery
Continuous delivery depends on a steady, predictable flow of small changes through the pipeline.
Push-based assignment produces the opposite: batch-based assignment at sprint planning, uneven
bursts of activity as different developers finish at different times, blocked work sitting idle
because the assigned person is busy with something else, and no team-level mechanism for optimizing
throughput. You cannot build a continuous flow of work when the assignment model is batch-based and
individually scoped.
How to Fix It
Step 1: Order the backlog by priority (Week 1)
Before switching to a pull model, the backlog must have a clear priority order. Without it,
developers will not know what to pull next.
- Work with the product owner to stack-rank the backlog. Every item has a unique position - no
tied priorities.
- Make the priority visible. The top of the board or backlog is the most important item. There
is no ambiguity.
- Agree as a team: when you need work, you pull from the top.
Step 2: Stop pre-assigning work in sprint planning (Week 2)
Change the sprint planning conversation. Instead of “who takes this story,” the team:
- Pulls items from the top of the prioritized backlog into the sprint.
- Discusses each item enough for anyone on the team to start it.
- Leaves all items unassigned.
The sprint begins with a list of prioritized work and no assignments. This will feel uncomfortable
for the first sprint.
Step 3: Pull work daily (Week 2+)
At the daily standup (or anytime during the day), a developer who needs work:
- Looks at the sprint board.
- Checks if any in-progress item needs help (swarm first, pull second).
- If nothing needs help and the WIP limit allows, pulls the top unassigned item and assigns
themselves.
The developer picks up the highest-priority available item, not the item that matches their
specialty. This is intentional - it spreads knowledge, reduces bus factor, and keeps the team
focused on priority rather than comfort.
Step 4: Address the discomfort (Weeks 3-4)
Expect these objections and plan for them:
| Objection |
Response |
| “But only Sarah knows the payment system” |
That is a knowledge silo and a risk. Pairing Sarah with someone else on payment stories fixes the silo while delivering the work. |
| “I assigned work because nobody was pulling it” |
If nobody pulls high-priority work, that is a signal: either the team doesn’t understand the priority, the item is poorly defined, or there is a skill gap. Assignment hides the signal instead of addressing it. |
| “Some developers are faster - I need to assign strategically” |
Pull systems self-balance. Faster developers pull more items. Slower developers finish fewer but are never overloaded. The team throughput optimizes naturally. |
| “Management expects me to know who’s working on what” |
The board shows who is working on what in real time. Pull systems provide more visibility than pre-assignment because assignments are always current, not a stale plan from sprint planning. |
Step 5: Combine with WIP limits (Week 4+)
Pull-based work and WIP limits reinforce each other:
- WIP limits prevent the team from pulling too much work at once.
- Pull-based assignment ensures that when someone finishes, they pull the next priority - not
whatever the manager thinks of next.
- Together, they create a system where work flows continuously from backlog to done.
See Limiting WIP for how to set and enforce WIP limits.
What managers do instead
Moving to a pull model does not eliminate the need for leadership. It changes the focus:
| Push model (before) |
Pull model (after) |
| Decide who works on what |
Ensure the backlog is prioritized and refined |
| Balance workloads manually |
Coach the team on swarming and collaboration |
| Track individual assignments |
Track flow metrics (cycle time, WIP, throughput) |
| Reassign work when priorities change |
Update backlog priority and let the team adapt |
| Manage individual utilization |
Remove systemic blockers the team cannot resolve |
Measuring Progress
| Metric |
What to look for |
| Percentage of stories pre-assigned at sprint start |
Should drop to near zero |
| Work in progress |
Should decrease as team focuses on finishing |
| Development cycle time |
Should decrease as swarming increases |
| Stories completed per sprint |
Should stabilize or increase despite less “busyness” |
| Rework rate |
Stories returned for rework or reopened after completion - should decrease |
| Knowledge distribution |
Track who works on which parts of the system - should broaden over time |
Related Content
2 - Branching and Integration
Anti-patterns in how teams branch, merge, and integrate code that prevent continuous integration and delivery.
These anti-patterns affect how code flows from a developer’s machine to the shared trunk. They
create painful merges, delayed integration, and broken builds that prevent the steady stream of
small, verified changes that continuous delivery requires.
2.1 - Long-Lived Feature Branches
Branches that live for weeks or months, turning merging into a project in itself. The longer the branch, the bigger the risk.
Category: Branching & Integration | Quality Impact: Critical
What This Looks Like
A developer creates a branch to build a feature. The feature is bigger than expected. Days pass,
then weeks. Other developers are doing the same thing on their own branches. Trunk moves forward
while each branch diverges further from it. Nobody integrates until the feature is “done” - and
by then, the branch is hundreds or thousands of lines different from where it started.
When the merge finally happens, it is an event. The developer sets aside half a day - sometimes
more - to resolve conflicts, re-test, and fix the subtle breakages that come from combining weeks
of divergent work. Other developers delay their merges to avoid the chaos. The team’s Slack channel
lights up with “don’t merge right now, I’m resolving conflicts.” Every merge creates a window where
trunk is unstable.
Common variations:
- The “feature branch” that is really a project. A branch named
feature/new-checkout that
lasts three months. Multiple developers commit to it. It has its own bug fixes and its own
merge conflicts. It is a parallel fork of the product.
- The “I’ll merge when it’s ready” branch. The developer views the branch as a private workspace.
Merging to trunk is the last step, not a daily practice. The branch falls further behind each day
but the developer does not notice until merge day.
- The per-sprint branch. Each sprint gets a branch. All sprint work goes there. The branch is
merged at sprint end and a new one is created. Integration happens every two weeks instead of
every day.
- The release isolation branch. A branch is created weeks before a release to “stabilize” it.
Bug fixes must be applied to both the release branch and trunk. Developers maintain two streams
of work simultaneously.
- The “too risky to merge” branch. The branch has diverged so far that nobody wants to attempt
the merge. It sits for weeks while the team debates how to proceed. Sometimes it is abandoned
entirely and the work is restarted.
The telltale sign: if merging a branch requires scheduling a block of time, notifying the team, or
hoping nothing goes wrong - branches are living too long.
Why This Is a Problem
Long-lived feature branches appear safe. Each developer works in isolation, free from interference.
But that isolation is precisely the problem. It delays integration, hides conflicts, and creates
compounding risk that makes every aspect of delivery harder.
It reduces quality
When a branch lives for weeks, code review becomes a formidable task. The reviewer faces hundreds
of changed lines across dozens of files. Meaningful review is nearly impossible at that scale -
studies consistently show that review effectiveness drops sharply after 200-400 lines of change.
Reviewers skim, approve, and hope for the best. Subtle bugs, design problems, and missed edge
cases survive because nobody can hold the full changeset in their head.
The isolation also means developers make decisions in a vacuum. Two developers on separate branches
may solve the same problem differently, introduce duplicate abstractions, or make contradictory
assumptions about shared code. These conflicts are invisible until merge time, when they surface as
bugs rather than design discussions.
With short-lived branches or trunk-based development, changes are small enough for genuine review.
A 50-line change gets careful attention. Design disagreements surface within hours, not weeks. The
team maintains a shared understanding of how the codebase is evolving because they see every change
as it happens.
It increases rework
Long-lived branches guarantee merge conflicts. Two developers editing the same file on different
branches will not discover the collision until one of them merges. The second developer must then
reconcile their changes against an unfamiliar modification, often without understanding the intent
behind it. This manual reconciliation is rework in its purest form - effort spent making code work
together that would have been unnecessary if the developers had integrated daily.
The rework compounds. A developer who rebases a three-week branch against trunk may introduce
bugs during conflict resolution. Those bugs require debugging. The debugging reveals an assumption
that was valid three weeks ago but is no longer true because trunk has changed. Now the developer
must rethink and partially rewrite their approach. What should have been a day of work becomes a
week.
When developers integrate daily, conflicts are small - typically a few lines. They are resolved in
minutes with full context because both changes are fresh. The cost of integration stays constant
rather than growing exponentially with branch age.
It makes delivery timelines unpredictable
A two-day feature on a long-lived branch takes two days to build and an unknown number of days
to merge. The merge might take an hour. It might take two days. It might surface a design conflict
that requires reworking the feature. Nobody knows until they try. This makes it impossible to
predict when work will actually be done.
The queuing effect makes it worse. When several branches need to merge, they form a queue. The
first merge changes trunk, which means the second branch needs to rebase against the new trunk
before merging. If the second merge is large, it changes trunk again, and the third branch must
rebase. Each merge invalidates the work done to prepare the next one. Teams that “schedule” their
merges are admitting that integration is so costly it needs coordination.
Project managers learn they cannot trust estimates. “The feature is code-complete” does not mean
it is done - it means the merge has not started yet. Stakeholders lose confidence in the team’s
ability to deliver on time because “done” and “deployed” are separated by an unpredictable gap.
With continuous integration, there is no merge queue. Each developer integrates small changes
throughout the day. The time from “code-complete” to “integrated and tested” is minutes, not days.
Delivery dates become predictable because the integration cost is near zero.
It hides risk until the worst possible moment
Long-lived branches create an illusion of progress. The team has five features “in development,”
each on its own branch. The features appear to be independent and on track. But the risk is
hidden: none of these features have been proven to work together. The branches may contain
conflicting changes, incompatible assumptions, or integration bugs that only surface when combined.
All of that hidden risk materializes at merge time - the moment closest to the planned release
date, when the team has the least time to deal with it. A merge conflict discovered three weeks
before release is an inconvenience. A merge conflict discovered the day before release is a crisis.
Long-lived branches systematically push risk discovery to the latest possible point.
Continuous integration surfaces risk immediately. If two changes conflict, the team discovers it
within hours, while both changes are small and the authors still have full context. Risk is
distributed evenly across the development cycle instead of concentrated at the end.
Impact on continuous delivery
Continuous delivery requires that trunk is always in a deployable state and that any commit can be
released at any time. Long-lived feature branches make both impossible. Trunk cannot be deployable
if large, poorly validated merges land periodically and destabilize it. You cannot release any commit
if the latest commit is a 2,000-line merge that has not been fully tested.
Long-lived branches also prevent continuous integration - the practice of integrating every
developer’s work into trunk at least once per day. Without continuous integration, there is no
continuous delivery. The pipeline cannot provide fast feedback on changes that exist only on
private branches. The team cannot practice deploying small changes because there are no small
changes - only large merges separated by days or weeks of silence.
Every other CD practice - automated testing, pipeline automation, small batches, fast feedback -
is undermined when the branching model prevents frequent integration.
How to Fix It
Step 1: Measure your current branch lifetimes (Week 1)
Before changing anything, understand the baseline. For every open branch:
- Record when it was created and when (or if) it was last merged.
- Calculate the age in days.
- Note the number of changed files and lines.
Most teams are shocked by their own numbers. A branch they think of as “a few days old” is often
two or three weeks old. Making the data visible creates urgency.
Set a target: no branch older than one day. This will feel aggressive. That is the point.
Step 2: Set a branch lifetime limit and make it visible (Week 2)
Agree as a team on a maximum branch lifetime. Start with two days if one day feels too aggressive.
The important thing is to pick a number and enforce it.
Make the limit visible:
- Add a dashboard or report that shows branch age for every open branch.
- Flag any branch that exceeds the limit in the daily standup.
- If your CI tool supports it, add a check that warns when a branch exceeds 24 hours.
The limit creates a forcing function. Developers must either integrate quickly or break their work
into smaller pieces. Both outcomes are desirable.
Step 3: Break large features into small, integrable changes (Weeks 2-3)
The most common objection is “my feature is too big to merge in a day.” This is true when the
feature is designed as a monolithic unit. The fix is decomposition:
- Branch by abstraction. Introduce a new code path alongside the old one. Merge the new code
path in small increments. Switch over when ready.
- Feature flags. Hide incomplete work behind a toggle so it can be merged to trunk without
being visible to users.
- Keystone interface pattern. Build all the back-end work first, merge it incrementally, and
add the UI entry point last. The feature is invisible until the keystone is placed.
- Vertical slices. Deliver the feature as a series of thin, user-visible increments instead of
building all layers at once.
Each technique lets developers merge daily without exposing incomplete functionality. The feature
grows incrementally on trunk rather than in isolation on a branch.
Step 4: Adopt short-lived branches with daily integration (Weeks 3-4)
Change the team’s workflow:
- Create a branch from trunk.
- Make a small, focused change.
- Get a quick review (the change is small, so review takes minutes).
- Merge to trunk. Delete the branch.
- Repeat.
Each branch lives for hours, not days. If a branch cannot be merged by end of day, it is too
large. The developer should either merge what they have (using one of the decomposition techniques
above) or discard the branch and start smaller tomorrow.
Pair this with the team’s code review practice. Small changes enable fast reviews, and fast reviews
enable short-lived branches. The two practices reinforce each other.
Step 5: Address the objections (Weeks 3-4)
| Objection |
Response |
| “My feature takes three weeks - I can’t merge in a day” |
The feature takes three weeks. The branch does not have to. Use branch by abstraction, feature flags, or vertical slicing to merge daily while the feature grows incrementally on trunk. |
| “Merging incomplete code to trunk is dangerous” |
Incomplete code behind a feature flag or without a UI entry point is not dangerous - it is invisible. The danger is a three-week branch that lands as a single untested merge. |
| “I need my branch to keep my work separate from other changes” |
That separation is the problem. You want to discover conflicts early, when they are small and cheap to fix. A branch that hides conflicts for three weeks is not protecting you - it is accumulating risk. |
| “We tried short-lived branches and it was chaos” |
Short-lived branches require supporting practices: feature flags, good decomposition, fast CI, and a culture of small changes. Without those supports, it will feel chaotic. The fix is to build the supports, not to retreat to long-lived branches. |
| “Code review takes too long for daily merges” |
Small changes take minutes to review, not hours. If reviews are slow, that is a review process problem, not a branching problem. See PR Review Bottlenecks. |
Step 6: Continuously tighten the limit (Week 5+)
Once the team is comfortable with two-day branches, reduce the limit to one day. Then push toward
integrating multiple times per day. Each reduction surfaces new problems - features that are hard
to decompose, tests that are slow, reviews that are bottlenecked - and each problem is worth
solving because it blocks the flow of work.
The goal is continuous integration: every developer integrates to trunk at least once per day.
At that point, “branches” are just short-lived workspaces that exist for hours, and merging is
a non-event.
Measuring Progress
| Metric |
What to look for |
| Average branch lifetime |
Should decrease to under one day |
| Maximum branch lifetime |
No branch should exceed two days |
| Integration frequency |
Should increase toward at least daily per developer |
| Merge conflict frequency |
Should decrease as branches get shorter |
| Merge duration |
Should decrease from hours to minutes |
| Development cycle time |
Should decrease as integration overhead drops |
| Lines changed per merge |
Should decrease as changes get smaller |
Related Content
2.2 - No Continuous Integration
The build has been red for weeks and nobody cares. “CI” means a build server exists, not that anyone actually integrates continuously.
Category: Branching & Integration | Quality Impact: Critical
What This Looks Like
The team has a build server. It runs after every push. There is a dashboard somewhere that shows
build status. But the build has been red for three weeks and nobody has mentioned it. Developers
push code, glance at the result if they remember, and move on. When someone finally investigates,
the failure is in a test that broke weeks ago and nobody can remember which commit caused it.
The word “continuous” has lost its meaning. Developers do not integrate their work into trunk
daily - they work on branches for days or weeks and merge when the feature feels done. The build
server runs, but nobody treats a red build as something that must be fixed immediately. There is no
shared agreement that trunk should always be green. “CI” is a tool in the infrastructure, not a
practice the team follows.
Common variations:
- The build server with no standards. A CI server runs on every push, but there are no rules
about what happens when it fails. Some developers fix their failures. Others do not. The build
flickers between green and red all day, and nobody trusts the signal.
- The nightly build. The build runs once per day, overnight. Developers find out the next
morning whether yesterday’s work broke something. By then they have moved on to new work and
lost context on what they changed.
- The “CI” that is just compilation. The build server compiles the code and nothing else. No
tests run. No static analysis. The build is green as long as the code compiles, which tells the
team almost nothing about whether the software works.
- The manually triggered build. The build server exists, but it does not run on push. After
pushing code, the developer must log into the CI server and manually start the build and tests.
When developers are busy or forget, their changes sit untested. When multiple pushes happen
between triggers, a failure could belong to any of them. The feedback loop depends entirely on
developer discipline rather than automation.
- The branch-only build. CI runs on feature branches but not on trunk. Each branch builds in
isolation, but nobody knows whether the branches work together until merge day. Trunk is not
continuously validated.
- The ignored dashboard. The CI dashboard exists but is not displayed anywhere the team can
see it. Nobody checks it unless they are personally waiting for a result. Failures accumulate
silently.
The telltale sign: if you can ask “how long has the build been red?” and nobody knows the answer,
continuous integration is not happening.
Why This Is a Problem
Continuous integration is not a tool - it is a practice. The practice requires that every developer
integrates to a shared trunk at least once per day and that the team treats a broken build as the
highest-priority problem. Without the practice, the build server is just infrastructure generating
notifications that nobody reads.
It reduces quality
When the build is allowed to stay red, the team loses its only automated signal that something is
wrong. A passing build is supposed to mean “the software works as tested.” A failing build is
supposed to mean “stop and fix this before doing anything else.” When failures are ignored, that
signal becomes meaningless. Developers learn that a red build is background noise, not an alarm.
Once the build signal is untrusted, defects accumulate. A developer introduces a bug on Monday. The
build fails, but it was already red from an unrelated failure, so nobody notices. Another developer
introduces a different bug on Tuesday. By Friday, trunk has multiple interacting defects and nobody
knows when they were introduced or by whom. Debugging becomes archaeology.
When the team practices continuous integration, a red build is rare and immediately actionable. The
developer who broke it knows exactly which change caused the failure because they committed minutes
ago. The fix is fast because the context is fresh. Defects are caught individually, not in tangled
clusters.
It increases rework
Without continuous integration, developers work in isolation for days or weeks. Each developer
assumes their code works because it passes on their machine or their branch. But they are building
on assumptions about shared code that may already be outdated. When they finally integrate, they
discover that someone else changed an API they depend on, renamed a class they import, or modified
behavior they rely on.
The rework cascade is predictable. Developer A changes a shared interface on Monday. Developer B
builds three days of work on the old interface. On Thursday, developer B tries to integrate and
discovers the conflict. Now they must rewrite three days of code to match the new interface. If
they had integrated on Monday, the conflict would have been a five-minute fix.
Teams that integrate continuously discover conflicts within hours, not days. The rework is measured
in minutes because the conflicting changes are small and the developers still have full context on
both sides. The total cost of integration stays low and constant instead of spiking unpredictably.
It makes delivery timelines unpredictable
A team without continuous integration cannot answer the question “is the software releasable right
now?” Trunk may or may not compile. Tests may or may not pass. The last successful build may have
been a week ago. Between then and now, dozens of changes have landed without anyone verifying that
they work together.
This creates a stabilization period before every release. The team stops feature work, fixes the
build, runs the test suite, and triages failures. This stabilization takes an unpredictable amount
of time - sometimes a day, sometimes a week - because nobody knows how many problems have
accumulated since the last known-good state.
With continuous integration, trunk is always in a known state. If the build is green, the team can
release. If the build is red, the team knows exactly which commit broke it and how long ago. There
is no stabilization period because the code is continuously stabilized. Release readiness is a
fact that can be checked at any moment, not a state that must be achieved through a dedicated
effort.
It masks the true cost of integration problems
When the build is permanently broken or rarely checked, the team cannot see the patterns that would
tell them where their process is failing. Is the build slow? Nobody notices because nobody waits
for it. Are certain tests flaky? Nobody notices because failures are expected. Do certain parts of
the codebase cause more breakage than others? Nobody notices because nobody correlates failures to
changes.
These hidden problems compound. The build gets slower because nobody is motivated to speed it up.
Flaky tests multiply because nobody quarantines them. Brittle areas of the codebase stay brittle
because the feedback that would highlight them is lost in the noise.
When the team practices CI and treats a red build as an emergency, every friction point becomes
visible. A slow build annoys the whole team daily, creating pressure to optimize it. A flaky test
blocks everyone, creating pressure to fix or remove it. The practice surfaces the problems. Without
the practice, the problems are invisible and grow unchecked.
Impact on continuous delivery
Continuous integration is the foundation that every other CD practice is built on. Without it, the
pipeline cannot give fast, reliable feedback on every change. Automated testing is pointless if
nobody acts on the results. Deployment automation is pointless if the artifact being deployed has
not been validated. Small batches are pointless if the batches are never verified to work together.
A team that does not practice CI cannot practice CD. The two are not independent capabilities that
can be adopted in any order. CI is the prerequisite. Every hour that the build stays red is an
hour during which the team has no automated confidence that the software works. Continuous delivery
requires that confidence to exist at all times.
How to Fix It
Step 1: Fix the build and agree it stays green (Week 1)
Before anything else, get trunk to green. This is the team’s first and most important commitment.
- Assign the broken build as the highest-priority work item. Stop feature work if necessary.
- Triage every failure: fix it, quarantine it to a non-blocking suite, or delete the test if it
provides no value.
- Once the build is green, make the team agreement explicit: a red build is the team’s top
priority. Whoever broke it fixes it. If they cannot fix it within 15 minutes, they revert
their change and try again with a smaller commit.
Write this agreement down. Put it in the team’s working agreements document. If you do not have
one, start one now. The agreement is simple: we do not commit on top of a red build, and we do not
leave a red build for someone else to fix.
Step 2: Make the build visible (Week 1)
The build status must be impossible to ignore:
- Display the build dashboard on a large monitor visible to the whole team.
- Configure notifications so that a broken build alerts the team immediately - in the team chat
channel, not in individual email inboxes.
- If the build breaks, the notification should identify the commit and the committer.
Visibility creates accountability. When the whole team can see that the build broke at 2:15 PM
and who broke it, social pressure keeps people attentive. When failures are buried in email
notifications, they are easily ignored.
Step 3: Require integration at least once per day (Week 2)
The “continuous” in continuous integration means at least daily, and ideally multiple times per day.
Set the expectation:
- Every developer integrates their work to trunk at least once per day.
- If a developer has been working on a branch for more than a day without integrating, that is a
problem to discuss at standup.
- Track integration frequency per developer
per day. Make it visible alongside the build dashboard.
This will expose problems. Some developers will say their work is not ready to integrate. That is a
decomposition problem - the work is too large. Some will say they cannot integrate because the build
is too slow. That is a pipeline problem. Each problem is worth solving. See
Long-Lived Feature Branches for techniques to break large work
into daily integrations.
Step 4: Make the build fast enough to provide useful feedback (Weeks 2-3)
A build that takes 45 minutes is a build that developers will not wait for. Target under 10
minutes for the primary feedback loop:
- Identify the slowest stages and optimize or parallelize them.
- Move slow integration tests to a secondary pipeline that runs after the fast suite passes.
- Add build caching so that unchanged dependencies are not recompiled on every run.
- Run tests in parallel if they are not already.
The goal is a fast feedback loop: the developer pushes, waits a few minutes, and knows whether
their change works with everything else. If they have to wait 30 minutes, they will context-switch,
and the feedback loop breaks.
Step 5: Address the objections (Weeks 3-4)
| Objection |
Response |
| “The build is too slow to fix every red immediately” |
Then the build is too slow, and that is a separate problem to solve. A slow build is not a reason to ignore failures - it is a reason to invest in making the build faster. |
| “Some tests are flaky - we can’t treat every failure as real” |
Quarantine flaky tests into a non-blocking suite. The blocking suite must be deterministic. If a test in the blocking suite fails, it is real until proven otherwise. |
| “We can’t integrate daily - our features take weeks” |
The features take weeks. The integrations do not have to. Use branch by abstraction, feature flags, or vertical slicing to integrate partial work daily. |
| “Fixing someone else’s broken build is not my job” |
It is the whole team’s job. A red build blocks everyone. If the person who broke it is unavailable, someone else should revert or fix it. The team owns the build, not the individual. |
| “We have CI - the build server runs on every push” |
A build server is not CI. CI is the practice of integrating frequently and keeping the build green. If the build has been red for a week, you have a build server, not continuous integration. |
Step 6: Build the habit (Week 4+)
Continuous integration is a daily discipline, not a one-time setup. Reinforce the habit:
- Review integration frequency in retrospectives. If it is dropping, ask why.
- Celebrate streaks of consecutive green builds. Make it a point of team pride.
- When a developer reverts a broken commit quickly, recognize it as the right behavior - not as a
failure.
- Periodically audit the build: is it still fast? Are new flaky tests creeping in? Is the test
coverage meaningful?
The goal is a team culture where a red build feels wrong - like an alarm that demands immediate
attention. When that instinct is in place, CI is no longer a process being followed. It is how
the team works.
Measuring Progress
| Metric |
What to look for |
| Build pass rate |
Percentage of builds that pass on first run - should be above 95% |
| Time to fix a broken build |
Should be under 15 minutes, with revert as the fallback |
| Integration frequency |
At least one integration per developer per day |
| Build duration |
Should be under 10 minutes for the primary feedback loop |
| Longest period with a red build |
Should be measured in minutes, not hours or days |
| Development cycle time |
Should decrease as integration overhead drops and stabilization periods disappear |
Related Content
3 - Testing
Anti-patterns in test strategy, test architecture, and quality practices that block continuous delivery.
These anti-patterns affect how teams build confidence that their code is safe to deploy. They
create slow pipelines, flaky feedback, and manual gates that prevent the continuous flow of
changes to production.
3.1 - No Test Automation
Zero automated tests. The team has no idea where to start and the codebase was not designed for testability.
Category: Testing & Quality | Quality Impact: Critical
What This Looks Like
The team deploys by manually verifying things work. Someone clicks through the application, checks
a few screens, and declares it good. There is no test suite. No test runner configured. No test
directory in the repository. The CI server, if one exists, builds the code and stops there.
When a developer asks “how do I know if my change broke something?” the answer is either “you
don’t” or “someone from QA will check it.” Bugs discovered in production are treated as inevitable.
Nobody connects the lack of automated tests to the frequency of production incidents because there
is no baseline to compare against.
Common variations:
- Tests exist but are never run. Someone wrote tests a year ago. The test suite is broken and
nobody has fixed it. The tests are checked into the repository but are not part of any pipeline
or workflow.
- Manual test scripts as the safety net. A spreadsheet or wiki page lists hundreds of manual
test cases. Before each release, someone walks through them by hand. The process takes days. It
is the only verification the team has.
- Testing is someone else’s job. Developers write code. A separate QA team tests it days or
weeks later. The feedback loop is so long that developers have moved on to other work by the
time defects are found.
- “The code is too legacy to test.” The team has decided the codebase is untestable.
Functions are thousands of lines long, everything depends on global state, and there are no
seams where test doubles could be inserted. This belief becomes self-fulfilling - nobody tries
because everyone agrees it is impossible.
The telltale sign: when a developer makes a change, the only way to verify it works is to deploy
it and see what happens.
Why This Is a Problem
Without automated tests, every change is a leap of faith. The team has no fast, reliable way to
know whether code works before it reaches users. Every downstream practice that depends on
confidence in the code - continuous integration, automated deployment, frequent releases - is
blocked.
It reduces quality
When there are no automated tests, defects are caught by humans or by users. Humans are slow,
inconsistent, and unable to check everything. A manual tester cannot verify 500 behaviors in an
hour, but an automated suite can. The behaviors that are not checked are the ones that break.
Developers writing code without tests have no feedback on whether their logic is correct until
someone else exercises it. A function that handles an edge case incorrectly will not be caught
until a user hits that edge case in production. By then, the developer has moved on and lost
context on the code they wrote.
With even a basic suite of automated tests, developers get feedback in minutes. They catch their
own mistakes while the code is fresh. The suite runs the same checks every time, never forgetting
an edge case and never getting tired.
It increases rework
Without tests, rework comes from two directions. First, bugs that reach production must be
investigated, diagnosed, and fixed - work that an automated test would have prevented. Second,
developers are afraid to change existing code because they have no way to verify they have not
broken something. This fear leads to workarounds: copy-pasting code instead of refactoring,
adding conditional branches instead of restructuring, and building new modules alongside old ones
instead of modifying what exists.
Over time, the codebase becomes a patchwork of workarounds layered on workarounds. Each change
takes longer because the code is harder to understand and more fragile. The absence of tests is
not just a testing problem - it is a design problem that compounds with every change.
Teams with automated tests refactor confidently. They rename functions, extract modules, and
simplify logic knowing that the test suite will catch regressions. The codebase stays clean
because changing it is safe.
It makes delivery timelines unpredictable
Without automated tests, the time between “code complete” and “deployed” is dominated by manual
verification. How long that verification takes depends on how many changes are in the batch, how
available the testers are, and how many defects they find. None of these variables are predictable.
A change that a developer finishes on Monday might not be verified until Thursday. If defects are
found, the cycle restarts. Lead time from commit to production is measured in weeks, and the
variance is enormous. Some changes take three days, others take three weeks, and the team cannot
predict which.
Automated tests collapse the verification step to minutes. The time from “code complete” to
“verified” becomes a constant, not a variable. Lead time becomes predictable because the largest
source of variance has been removed.
Impact on continuous delivery
Automated tests are the foundation of continuous delivery. Without them, there is no automated
quality gate. Without an automated quality gate, there is no safe way to deploy frequently.
Without frequent deployment, there is no fast feedback from production. Every CD practice assumes
that the team can verify code quality automatically. A team with no test automation is not on a
slow path to CD - they have not started.
How to Fix It
Starting test automation on an untested codebase feels overwhelming. The key is to start small,
establish the habit, and expand coverage incrementally. You do not need to test everything before
you get value - you need to test something and keep going.
Step 1: Set up the test infrastructure (Week 1)
Before writing a single test, make it trivially easy to run tests:
- Choose a test framework for your primary language. Pick the most popular one - do not
deliberate.
- Add the framework to the project. Configure it. Write a single test that asserts
true == true
and verify it passes.
- Add a
test script or command to the project so that anyone can run the suite with a single
command (e.g., npm test, pytest, mvn test).
- Add the test command to the CI pipeline so that tests run on every push.
The goal for week one is not coverage. It is infrastructure: a working test runner in the pipeline
that the team can build on.
Step 2: Write tests for every new change (Week 2+)
Establish a team rule: every new change must include at least one automated test. Not “every new
feature” - every change. Bug fixes get a regression test that fails without the fix and passes
with it. New functions get a test that verifies the core behavior. Refactoring gets a test that
pins the existing behavior before changing it.
This rule is more important than retroactive coverage. New code enters the codebase tested. The
tested portion grows with every commit. After a few months, the most actively changed code has
coverage, which is exactly where coverage matters most.
Step 3: Target high-change areas for retroactive coverage (Weeks 3-6)
Use your version control history to find the files that change most often. These are the files
where bugs are most likely and where tests provide the most value:
- List the 10 files with the most commits in the last six months.
- For each file, write tests for its core public behavior. Do not try to test every line - test
the functions that other code depends on.
- If the code is hard to test because of tight coupling, wrap it. Create a thin adapter around
the untestable code and test the adapter. This is the Strangler Fig pattern applied to testing.
Step 4: Make untestable code testable incrementally (Weeks 4-8)
If the codebase resists testing, introduce seams one at a time:
| Problem |
Technique |
| Function does too many things |
Extract the pure logic into a separate function and test that |
| Hard-coded database calls |
Introduce a repository interface, inject it, test with a fake |
| Global state or singletons |
Pass dependencies as parameters instead of accessing globals |
| No dependency injection |
Start with “poor man’s DI” - default parameters that can be overridden in tests |
You do not need to refactor the entire codebase. Each time you touch a file, leave it slightly
more testable than you found it.
Step 5: Set a coverage floor and ratchet it up (Week 6+)
Once you have meaningful coverage in actively changed code, set a coverage threshold in the
pipeline:
- Measure current coverage. Say it is 15%.
- Set the pipeline to fail if coverage drops below 15%.
- Every two weeks, raise the floor by 2-5 percentage points.
The floor prevents backsliding. The ratchet ensures progress. The team does not need to hit 90%
coverage - they need to ensure that coverage only goes up.
| Objection |
Response |
| “The codebase is too legacy to test” |
You do not need to test the legacy code directly. Wrap it in testable adapters and test those. Every new change gets a test. Coverage grows from the edges inward. |
| “We don’t have time to write tests” |
You are already spending that time on manual verification and production debugging. Tests shift that cost to the left where it is cheaper. Start with one test per change - the overhead is minutes, not hours. |
| “We need to test everything before it’s useful” |
One test that catches one regression is more useful than zero tests. The value is immediate and cumulative. You do not need full coverage to start getting value. |
| “Developers don’t know how to write tests” |
Pair a developer who has testing experience with one who does not. If nobody on the team has experience, invest one day in a testing workshop. The skill is learnable in a week. |
Measuring Progress
| Metric |
What to look for |
| Test count |
Should increase every sprint |
| Code coverage of actively changed files |
More meaningful than overall coverage - focus on files changed in the last 30 days |
| Build duration |
Should increase slightly as tests are added, but stay under 10 minutes |
| Defects found in production vs. in tests |
Ratio should shift toward tests over time |
| Change fail rate |
Should decrease as test coverage catches regressions before deployment |
| Manual testing effort per release |
Should decrease as automated tests replace manual verification |
Related Content
3.2 - Manual Regression Testing Gates
Every release requires days or weeks of manual testing. Testers execute scripted test cases. Test effort scales linearly with application size.
Category: Testing & Quality | Quality Impact: Critical
What This Looks Like
Before every release, the team enters a testing phase. Testers open a spreadsheet or test
management tool containing hundreds of scripted test cases. They walk through each one by hand:
click this button, enter this value, verify this result. The testing takes days. Sometimes it takes
weeks. Nothing ships until every case is marked pass or fail, and every failure is triaged.
Developers stop working on new features during this phase because testers need a stable build to
test against. Code freezes go into effect. Bug fixes discovered during testing must be applied
carefully to avoid invalidating tests that have already passed. The team enters a holding pattern
where the only work that matters is getting through the test cases.
The testing effort grows with every release. New features add new test cases, but old test cases
are rarely removed because nobody is confident they are redundant. A team that tested for three
days six months ago now tests for five. The spreadsheet has 800 rows. Every release takes longer
to validate than the last.
Common variations:
- The regression spreadsheet. A master spreadsheet of every test case the team has ever
written. Before each release, a tester works through every row. The spreadsheet is the
institutional memory of what the software is supposed to do, and nobody trusts anything else.
- The dedicated test phase. The sprint cadence is two weeks of development followed by one week
of testing. The test week is a mini-waterfall phase embedded in an otherwise agile process.
Nothing can ship until the test phase is complete.
- The test environment bottleneck. Manual testing requires a specific environment that is shared
across teams. The team must wait for their slot. When the environment is broken by another team’s
testing, everyone waits for it to be restored.
- The sign-off ceremony. A QA lead or manager must personally verify a subset of critical paths
and sign a document before the release can proceed. If that person is on vacation, the release
waits.
- The compliance-driven test cycle. Regulatory requirements are interpreted as requiring manual
execution of every test case with documented evidence. Each test run produces screenshots and
sign-off forms. The documentation takes as long as the testing itself.
The telltale sign: if the question “can we release today?” is always answered with “not until QA
finishes,” manual regression testing is gating your delivery.
Why This Is a Problem
Manual regression testing feels responsible. It feels thorough. But it creates a bottleneck that
grows worse with every feature the team builds, and the thoroughness it promises is an illusion.
It reduces quality
Manual testing is less reliable than it appears. A human executing the same test case for the
hundredth time will miss things. Attention drifts. Steps get skipped. Edge cases that seemed
important when the test was written get glossed over when the tester is on row 600 of a
spreadsheet. Studies on manual testing consistently show that testers miss 15-30% of defects
that are present in the software they are testing.
The test cases themselves decay. They were written for the version of the software that existed
when the feature shipped. As the product evolves, some cases become irrelevant, others become
incomplete, and nobody updates them systematically. The team is executing a test plan that
partially describes software that no longer exists.
The feedback delay compounds the quality problem. A developer who wrote code two weeks ago gets
a bug report from a tester during the regression cycle. The developer has lost context on the
change. They re-read their own code, try to remember what they were thinking, and fix the bug
with less confidence than they would have had the day they wrote it.
Automated tests catch the same classes of bugs in seconds, with perfect consistency, every time
the code changes. They do not get tired on row 600. They do not skip steps. They run against the
current version of the software, not a test plan written six months ago. And they give feedback
immediately, while the developer still has full context.
It increases rework
The manual testing gate creates a batch-and-queue cycle. Developers write code for two weeks, then
testers spend a week finding bugs in that code. Every bug found during the regression cycle is
rework: the developer must stop what they are doing, reload the context of a completed story,
diagnose the issue, fix it, and send it back to the tester for re-verification. The re-verification
may invalidate other test cases, requiring additional re-testing.
The batch size amplifies the rework. When two weeks of changes are tested together, a bug could be
in any of dozens of commits. Narrowing down the cause takes longer because there are more
variables. When the same bug would have been caught by an automated test minutes after it was
introduced, the developer would have fixed it in the same sitting - one context switch instead of
many.
The rework also affects testers. A bug fix during the regression cycle means the tester must re-run
affected test cases. If the fix changes behavior elsewhere, the tester must re-run those cases too.
A single bug fix can cascade into hours of re-testing, pushing the release date further out.
With automated regression tests, bugs are caught as they are introduced. The fix happens
immediately. There is no regression cycle, no re-testing cascade, and no context-switching penalty.
It makes delivery timelines unpredictable
The regression testing phase takes as long as it takes. The team cannot predict how many bugs the
testers will find, how long each fix will take, or how much re-testing the fixes will require. A
release planned for Friday might slip to the following Wednesday. Or the following Friday.
This unpredictability cascades through the organization. Product managers cannot commit to delivery
dates because they do not know how long testing will take. Stakeholders learn to pad their
expectations. “We’ll release in two weeks” really means “we’ll release in two to four weeks,
depending on what QA finds.”
The unpredictability also creates pressure to cut corners. When the release is already three days
late, the team faces a choice: re-test thoroughly after a late bug fix, or ship without full
re-testing. Under deadline pressure, most teams choose the latter. The manual testing gate that
was supposed to ensure quality becomes the reason quality is compromised.
Automated regression suites produce predictable, repeatable results. The suite runs in the same
amount of time every time. There is no testing phase to slip. The team knows within minutes of
every commit whether the software is releasable.
It creates a permanent scaling problem
Manual testing effort scales linearly with application size. Every new feature adds test cases.
The test suite never shrinks. A team that takes three days to test today will take four days in
six months and five days in a year. The testing phase consumes an ever-growing fraction of the
team’s capacity.
This scaling problem is invisible at first. Three days of testing feels manageable. But the growth
is relentless. The team that started with 200 test cases now has 800. The test phase that was two
days is now a week. And because the test cases were written by different people at different times,
nobody can confidently remove any of them without risking a missed regression.
Automated tests scale differently. Adding a new automated test adds milliseconds to the suite
duration, not hours to the testing phase. A team with 10,000 automated tests runs them in the same
10 minutes as a team with 1,000. The cost of confidence is fixed, not linear.
Impact on continuous delivery
Manual regression testing is fundamentally incompatible with continuous delivery. CD requires that
any commit can be released at any time. A manual testing gate that takes days means the team can
release at most once per testing cycle. If the gate takes a week, the team releases at most every
two or three weeks - regardless of how fast their pipeline is or how small their changes are.
The manual gate also breaks the feedback loop that CD depends on. CD gives developers confidence
that their change works by running automated checks within minutes. A manual gate replaces that
fast feedback with a slow, batched, human process that cannot keep up with the pace of development.
You cannot have continuous delivery with a manual regression gate. The two are mutually exclusive.
The gate must be automated before CD is possible.
How to Fix It
Step 1: Catalog your manual test cases and categorize them (Week 1)
Before automating anything, understand what the manual test suite actually covers. For every test
case in the regression suite:
- Identify what behavior it verifies.
- Classify it: is it testing business logic, a UI flow, an integration boundary, or a compliance
requirement?
- Rate its value: has this test ever caught a real bug? When was the last time?
- Rate its automation potential: can this be tested at a lower level (unit, functional, API)?
Most teams discover that a large percentage of their manual test cases are either redundant (the
same behavior is tested multiple times), outdated (the feature has changed), or automatable at a
lower level.
Step 2: Automate the highest-value cases first (Weeks 2-4)
Pick the 20 test cases that cover the most critical paths - the ones that would cause the most
damage if they regressed. Automate them:
- Business logic tests become unit tests.
- API behavior tests become functional tests.
- Critical user journeys become a small set of E2E smoke tests.
Do not try to automate everything at once. Start with the cases that give the most confidence per
minute of execution time. The goal is to build a fast automated suite that covers the riskiest
scenarios so the team no longer depends on manual execution for those paths.
Step 3: Run automated tests in the pipeline on every commit (Week 3)
Move the new automated tests into the CI pipeline so they run on every push. This is the critical
shift: testing moves from a phase at the end of development to a continuous activity that happens
with every change.
Every commit now gets immediate feedback on the critical paths. If a regression is introduced, the
developer knows within minutes - not weeks.
Step 4: Shrink the manual suite as automation grows (Weeks 4-8)
Each week, pick another batch of manual test cases and either automate or retire them:
- Automate cases where the behavior is stable and testable at a lower level.
- Retire cases that are redundant with existing automated tests or that test behavior that no
longer exists.
- Keep manual only for genuinely exploratory testing that requires human judgment - usability
evaluation, visual design review, or complex workflows that resist automation.
Track the shrinkage. If the manual suite had 800 cases and now has 400, that is progress. If the
manual testing phase took five days and now takes two, that is measurable improvement.
Step 5: Replace the testing phase with continuous testing (Weeks 6-8+)
The goal is to eliminate the dedicated testing phase entirely:
| Before |
After |
| Code freeze before testing |
No code freeze - trunk is always testable |
| Testers execute scripted cases |
Automated suite runs on every commit |
| Bugs found days or weeks after coding |
Bugs found minutes after coding |
| Testing phase blocks release |
Release readiness checked automatically |
| QA sign-off required |
Pipeline pass is the sign-off |
| Testers do manual regression |
Testers do exploratory testing, write automated tests, and improve test infrastructure |
Step 6: Address the objections (Ongoing)
| Objection |
Response |
| “Automated tests can’t catch everything a human can” |
Correct. But humans cannot execute 800 test cases reliably in a day, and automated tests can. Automate the repeatable checks and free humans for the exploratory testing where their judgment adds value. |
| “We need manual testing for compliance” |
Most compliance frameworks require evidence that testing was performed, not that humans performed it. Automated test reports with pass/fail results, timestamps, and traceability to requirements satisfy most audit requirements better than manual spreadsheets. Confirm with your compliance team. |
| “Our testers don’t know how to write automated tests” |
Pair testers with developers. The tester contributes domain knowledge - what to test and why - while the developer contributes automation skills. Over time, the tester learns automation and the developer learns testing strategy. |
| “We can’t automate tests for our legacy system” |
Start with new code. Every new feature gets automated tests. For legacy code, automate the most critical paths first and expand coverage as you touch each area. The legacy system does not need 100% automation overnight. |
| “What if we automate a test wrong and miss a real bug?” |
Manual tests miss real bugs too - consistently. An automated test that is wrong can be fixed once and stays fixed. A manual tester who skips a step makes the same mistake next time. Automation is not perfect, but it is more reliable and more improvable than manual execution. |
Measuring Progress
| Metric |
What to look for |
| Manual test case count |
Should decrease steadily as cases are automated or retired |
| Manual testing phase duration |
Should shrink toward zero |
| Automated test count in pipeline |
Should grow as manual cases are converted |
| Release frequency |
Should increase as the manual gate shrinks |
| Development cycle time |
Should decrease as the testing phase is eliminated |
| Time from code complete to release |
Should converge toward pipeline duration, not testing phase duration |
Related Content
3.3 - Flaky Test Suites
Tests randomly pass or fail. Developers rerun the pipeline until it goes green. Nobody trusts the test suite to tell them anything useful.
Category: Testing & Quality | Quality Impact: High
What This Looks Like
A developer pushes a change. The pipeline fails. They look at the failure - it is a test they did
not touch, in a module they did not change. They click “rerun.” It passes. They merge.
This happens multiple times a day across the team. Nobody investigates failures on the first
occurrence because the odds favor flakiness over a real problem. When someone mentions a test
failure in standup, the first question is “did you rerun it?” not “what broke?”
Common variations:
- The nightly lottery. The full suite runs overnight. Every morning, a different random subset
of tests is red. Someone triages the failures, marks most as flaky, and the team moves on. Real
regressions hide in the noise.
- The retry-until-green pattern. The pipeline configuration automatically reruns failed tests
two or three times. If a test passes on any attempt, it counts as passed. The team considers
this solved. In reality, it masks failures and doubles or triples pipeline duration.
- The “known flaky” tag. Tests are annotated with a skip or known-flaky marker. The suite
ignores them. The list grows over time. Nobody goes back to fix them because they are out of
sight.
- Environment-dependent failures. Tests pass on developer machines but fail in CI, or pass in
CI but fail on Tuesdays. The failures correlate with shared test environments, time-of-day
load patterns, or external service availability.
- Test order dependency. Tests pass when run in a specific order but fail when run in
isolation or in a different sequence. Shared mutable state from one test leaks into another.
The telltale sign: the team has a shared understanding that the first pipeline failure “doesn’t
count.” Rerunning the pipeline is a routine step, not an exception.
Why This Is a Problem
Flaky tests are not a minor annoyance. They systematically destroy the value of the test suite by
making it impossible to distinguish signal from noise. A test suite that sometimes lies is worse
than no test suite at all, because it creates an illusion of safety.
It reduces quality
When tests fail randomly, developers stop trusting them. The rational response to a flaky suite
is to ignore failures - and that is exactly what happens. A developer whose pipeline fails three
times a week for reasons unrelated to their code learns to click “rerun” without reading the
error message.
This behavior is invisible most of the time. It becomes catastrophic when a real regression
happens. The test that catches the regression fails, the developer reruns because “it’s probably
flaky,” it passes on the second run because the flaky behavior went the other way, and the
regression ships to production. The test did its job, but the developer’s trained behavior
neutralized it.
In a suite with zero flaky tests, every failure demands investigation. Developers read the error,
find the cause, and fix it. Failures are rare and meaningful. The suite functions as a reliable
quality gate.
It increases rework
Flaky tests cause rework in two ways. First, developers spend time investigating failures that
turn out to be noise. A developer sees a test failure, spends 20 minutes reading the error and
reviewing their change, realizes the failure is unrelated, and reruns. Multiply this by every
developer on the team, multiple times per day.
Second, the retry-until-green pattern extends pipeline duration. A pipeline that should take 8
minutes takes 20 because failed tests are rerun automatically. Developers wait longer for
feedback, context-switch more, and lose more time to task switching.
Teams with deterministic test suites waste zero time investigating flaky failures. Their pipeline
runs once, gives an answer, and the developer acts on it.
It makes delivery timelines unpredictable
A flaky suite introduces randomness into the delivery process. The same code, submitted twice,
might pass the pipeline on the first attempt or take three reruns. Lead time from commit to merge
varies not because of code quality but because of test noise.
When the team needs to ship urgently, flaky tests become a source of anxiety. “Will the pipeline
pass this time?” The team starts planning around the flakiness - running the pipeline early “in
case it fails,” avoiding changes late in the day because there might not be time for reruns. The
delivery process is shaped by the unreliability of the tests rather than by the quality of the
code.
Deterministic tests make delivery time a function of code quality alone. The pipeline is a
predictable step that takes the same amount of time every run. There are no surprises.
It normalizes ignoring failures
The most damaging effect of flaky tests is cultural. Once a team accepts that test failures are
often noise, the standard for investigating failures drops permanently. New team members learn
from day one that “you just rerun it.” The bar for adding a flaky test to the suite is low
because one more flaky test is barely noticeable when there are already dozens.
This normalization extends beyond tests. If the team tolerates unreliable automated checks, they
will tolerate unreliable monitoring, unreliable alerts, and unreliable deploys. Flaky tests teach
the team that automation is not trustworthy - exactly the opposite of what CD requires.
Impact on continuous delivery
Continuous delivery depends on automated quality gates that the team trusts completely. A flaky
suite is a quality gate with a broken lock - it looks like it is there, but it does not actually
stop anything. Developers bypass it by rerunning. Regressions pass through it by luck.
The pipeline must be a machine that answers one question with certainty: “Is this change safe to
deploy?” A flaky suite answers “probably, maybe, rerun and ask again.” That is not a foundation
you can build continuous delivery on.
How to Fix It
Step 1: Measure the flakiness (Week 1)
Before fixing anything, quantify the problem:
- Collect pipeline run data for the last 30 days. Count the number of runs that failed and were
rerun without code changes.
- Identify which specific tests failed across those reruns. Rank them by failure frequency.
- Calculate the pipeline pass rate: what percentage of first-attempt runs succeed?
This gives you a hit list and a baseline. If your first-attempt pass rate is 60%, you know 40% of
pipeline runs are wasted on flaky failures.
Step 2: Quarantine the worst offenders (Week 1)
Take the top 10 flakiest tests and move them out of the pipeline-gating suite immediately. Do not
fix them yet - just remove them from the critical path.
- Move them to a separate test suite that runs on a schedule (nightly or hourly) but does not
block merges.
- Create a tracking issue for each quarantined test with its failure rate and the suspected cause.
This immediately improves pipeline reliability. The team sees fewer false failures on day one.
Step 3: Fix or replace quarantined tests (Weeks 2-4)
Work through the quarantined tests systematically. For each one, identify the root cause:
| Root cause |
Fix |
| Shared mutable state (database, filesystem, cache) |
Isolate test data. Each test creates and destroys its own state. Use transactions or test containers. |
| Timing dependencies (sleep, setTimeout, polling) |
Replace time-based waits with event-based waits. Wait for a condition, not a duration. |
| Test order dependency |
Ensure each test is self-contained. Run tests in random order to surface hidden dependencies. |
| External service dependency |
Replace with a test double. Validate the double with a contract test. |
| Race conditions in async code |
Use deterministic test patterns. Await promises. Avoid fire-and-forget in test code. |
| Resource contention (ports, files, shared environments) |
Allocate unique resources per test. Use random ports. Use temp directories. |
For each quarantined test, either fix it and return it to the gating suite or replace it with a
deterministic lower-level test that covers the same behavior.
Step 4: Prevent new flaky tests from entering the suite (Week 3+)
Establish guardrails so the problem does not recur:
- Run new tests 10 times in CI before merging them. If any run fails, the test is flaky and must
be fixed before it enters the suite.
- Run the full suite in random order. This surfaces order-dependent tests immediately.
- Track the pipeline first-attempt pass rate as a team metric. Make it visible on a dashboard.
Set a target (e.g., 95%) and treat drops below the target as incidents.
- Add a team working agreement: flaky tests are treated as bugs with the same priority as
production defects.
Step 5: Eliminate automatic retries (Week 4+)
If the pipeline is configured to automatically retry failed tests, turn it off. Retries mask
flakiness instead of surfacing it. Once the quarantine and prevention steps are in place, the
suite should be reliable enough to run once.
If a test fails, it should mean something. Retries teach the team that failures are meaningless.
| Objection |
Response |
| “Retries are fine - they handle transient issues” |
Transient issues in a test suite are a symptom of external dependencies or shared state. Fix the root cause instead of papering over it with retries. |
| “We don’t have time to fix flaky tests” |
Calculate the time the team spends rerunning pipelines and investigating false failures. It is almost always more than the time to fix the flaky tests. |
| “Some flakiness is inevitable with E2E tests” |
That is an argument for fewer E2E tests, not for tolerating flakiness. Push the test down to a level where it can be deterministic. |
| “The flaky test sometimes catches real bugs” |
A test that catches real bugs 5% of the time and false-alarms 20% of the time is a net negative. Replace it with a deterministic test that catches the same bugs 100% of the time. |
Measuring Progress
| Metric |
What to look for |
| Pipeline first-attempt pass rate |
Should climb toward 95%+ |
| Number of quarantined tests |
Should decrease to zero as tests are fixed or replaced |
| Pipeline reruns per week |
Should drop to near zero |
| Build duration |
Should decrease as retries are removed |
| Development cycle time |
Should decrease as developers stop waiting for reruns |
| Developer trust survey |
Ask quarterly: “Do you trust the test suite to catch real problems?” Answers should improve. |
Related Content
3.4 - Inverted Test Pyramid
Most tests are slow end-to-end or UI tests. Few unit tests. The test suite is slow, brittle, and expensive to maintain.
Category: Testing & Quality | Quality Impact: High
What This Looks Like
The team has tests, but the wrong kind. Running the full suite takes 30 minutes or more. Tests
fail randomly. Developers rerun the pipeline and hope for green. When a test fails, the first
question is “is that a real failure or a flaky test?” rather than “what did I break?”
Common variations:
- The ice cream cone. Most testing is manual. Below that, a large suite of end-to-end browser
tests. A handful of integration tests. Almost no unit tests. The manual testing takes days, the
E2E suite takes hours, and nothing runs fast enough to give developers feedback while they code.
- The E2E-first approach. The team believes end-to-end tests are “real” tests because they
test the “whole system.” Unit tests are dismissed as “not testing anything useful” because they
use mocks. The result is a suite of 500 Selenium tests that take 45 minutes and fail 10% of
the time.
- The integration test swamp. Every test boots a real database, calls real services, and
depends on shared test environments. Tests are slow because they set up and tear down heavy
infrastructure. They are flaky because they depend on network availability and shared mutable
state.
- The UI test obsession. The team writes tests exclusively through the UI layer. Business
logic that could be verified in milliseconds with a unit test is instead tested through a
full browser automation flow that takes seconds per assertion.
- The “we have coverage” illusion. Code coverage is high because the E2E tests exercise most
code paths. But the tests are so slow and brittle that developers do not run them locally. They
push code and wait 40 minutes to learn if it works. If a test fails, they assume it is flaky
and rerun.
The telltale sign: developers do not trust the test suite. They push code and go get coffee. When
tests fail, they rerun before investigating. When a test is red for days, nobody is alarmed.
Why This Is a Problem
An inverted test pyramid does not just slow the team down. It actively undermines every benefit
that testing is supposed to provide.
The suite is too slow to give useful feedback
The purpose of a test suite is to tell developers whether their change works - fast enough that
they can act on the feedback while they still have context. A suite that runs in seconds gives
feedback during development. A suite that runs in minutes gives feedback before the developer
moves on. A suite that runs in 30 or more minutes gives feedback after the developer has started
something else entirely.
When the suite takes 40 minutes, developers do not run it locally. They push to CI and context-
switch to a different task. When the result comes back, they have lost the mental model of the
code they changed. Investigating a failure takes longer because they have to re-read their own
code. Fixing the failure takes longer because they are now juggling two streams of work.
A well-structured suite - heavy on unit tests, light on E2E - runs in under 10 minutes. Developers
run it locally before pushing. Failures are caught while the code is still fresh. The feedback
loop is tight enough to support continuous integration.
Flaky tests destroy trust
End-to-end tests are inherently non-deterministic. They depend on network connectivity, shared
test environments, external service availability, browser rendering timing, and dozens of other
factors outside the developer’s control. A test that fails because a third-party API was slow for
200 milliseconds looks identical to a test that fails because the code is wrong.
When 10% of the suite fails randomly on any given run, developers learn to ignore failures. They
rerun the pipeline, and if it passes the second time, they assume the first failure was noise.
This behavior is rational given the incentives, but it is catastrophic for quality. Real failures
hide behind the noise. A test that detects a genuine regression gets rerun and ignored alongside
the flaky tests.
Unit tests and functional tests with test doubles are deterministic. They produce the same result
every time. When a deterministic test fails, the developer knows with certainty that they broke
something. There is no rerun. There is no “is that real?” The failure demands investigation.
Maintenance cost grows faster than value
End-to-end tests are expensive to write and expensive to maintain. A single E2E test typically
involves:
- Setting up test data across multiple services
- Navigating through UI flows with waits and retries
- Asserting on UI elements that change with every redesign
- Handling timeouts, race conditions, and flaky selectors
When a feature changes, every E2E test that touches that feature must be updated. A redesign of
the checkout page breaks 30 E2E tests even if the underlying behavior has not changed. The team
spends more time maintaining E2E tests than writing new features.
Unit tests are cheap to write and cheap to maintain. They test behavior, not UI layout. A function
that calculates a discount does not care whether the button is blue or green. When the discount
logic changes, one or two unit tests need updating - not thirty browser flows.
It couples your pipeline to external systems
When most of your tests are end-to-end or integration tests that hit real services, your ability
to deploy depends on every system in the chain being available and healthy. If the payment
provider’s sandbox is down, your pipeline fails. If the shared staging database is slow, your
tests time out. If another team deployed a breaking change to a shared service, your tests fail
even though your code is correct.
This is the opposite of what CD requires. Continuous delivery demands that your team can deploy
independently, at any time, regardless of the state of external systems. A test architecture
built on E2E tests makes your deployment hostage to every dependency in your ecosystem.
A suite built on unit tests, functional tests, and contract tests runs entirely within your
control. External dependencies are replaced with test doubles that are validated by contract
tests. Your pipeline can tell you “this change is safe to deploy” even if every external system
is offline.
Impact on continuous delivery
The inverted pyramid makes CD impossible in practice even if all the other pieces are in place.
The pipeline takes too long to support frequent integration. Flaky failures erode trust in the
automated quality gates. Developers bypass the tests or batch up changes to avoid the wait. The
team gravitates toward manual verification before deploying because they do not trust the
automated suite.
A team that deploys weekly with a 40-minute flaky suite cannot deploy daily without either fixing
the test architecture or abandoning automated quality gates. Neither option is acceptable.
Fixing the architecture is the only sustainable path.
How to Fix It
Inverting the pyramid does not mean deleting all your E2E tests and writing unit tests from
scratch. It means shifting the balance deliberately over time so that most confidence comes from
fast, deterministic tests and only a small amount comes from slow, non-deterministic ones.
Step 1: Audit your current test distribution (Week 1)
Count your tests by type and measure their characteristics:
| Test type |
Count |
Total duration |
Flaky? |
Requires external systems? |
| Unit |
? |
? |
? |
? |
| Integration |
? |
? |
? |
? |
| Functional |
? |
? |
? |
? |
| E2E |
? |
? |
? |
? |
| Manual |
? |
N/A |
N/A |
N/A |
Run the full suite three times. Note which tests fail intermittently. Record the total duration.
This is your baseline.
Move every flaky test out of the pipeline-gating suite into a separate quarantine suite. This is
not deleting them - it is removing them from the critical path so that real failures are visible.
For each quarantined test, decide:
- Fix it if the behavior it tests is important and the flakiness has a solvable cause (timing
dependency, shared state, test order dependency).
- Replace it with a faster, deterministic test that covers the same behavior at a lower level.
- Delete it if the behavior is already covered by other tests or is not worth the maintenance
cost.
Target: zero flaky tests in the pipeline-gating suite by end of week.
Step 3: Push tests down the pyramid (Weeks 2-4)
For each E2E test in your suite, ask: “Can the behavior this test verifies be tested at a lower
level?”
Most of the time, the answer is yes. An E2E test that verifies “user can apply a discount code”
is actually testing three things:
- The discount calculation logic (testable with a unit test)
- The API endpoint that accepts the code (testable with a functional test)
- The UI flow for entering the code (testable with a component test)
Write the lower-level tests first. Once they exist and pass, the E2E test is redundant for gating
purposes. Move it to a post-deployment smoke suite or delete it.
Work through your E2E suite systematically, starting with the slowest and most flaky tests. Each
test you push down the pyramid makes the suite faster and more reliable.
Step 4: Replace external dependencies with test doubles (Weeks 2-4)
Identify every test that calls a real external service and replace the dependency:
| Dependency type |
Test double approach |
| Database |
In-memory database, testcontainers, or repository fakes |
| External HTTP API |
HTTP stubs (WireMock, nock, MSW) |
| Message queue |
In-memory fake or test spy |
| File storage |
In-memory filesystem or temp directory |
| Third-party service |
Stub that returns canned responses |
Validate your test doubles with contract tests that run asynchronously. This ensures your doubles
stay accurate without coupling your pipeline to external systems.
Step 5: Adopt the test-for-every-change rule (Ongoing)
New code should be tested at the lowest possible level. Establish the team norm:
- Every new function with logic gets a unit test.
- Every new API endpoint or integration boundary gets a functional test.
- E2E tests are only added for critical smoke paths - not for every feature.
- Every bug fix includes a regression test at the lowest level that catches the bug.
Over time, this rule shifts the pyramid naturally. New code enters the codebase with the right
test distribution even as the team works through the legacy E2E suite.
Step 6: Address the objections
| Objection |
Response |
| “Unit tests with mocks don’t test anything real” |
They test logic, which is where most bugs live. A discount calculation that returns the wrong number is a real bug whether it is caught by a unit test or an E2E test. The unit test catches it in milliseconds. The E2E test catches it in minutes - if it is not flaky that day. |
| “E2E tests catch integration bugs that unit tests miss” |
Functional tests with test doubles catch most integration bugs. Contract tests catch the rest. The small number of integration bugs that only E2E can find do not justify a suite of hundreds of slow, flaky E2E tests. |
| “We can’t delete E2E tests - they’re our safety net” |
They are a safety net with holes. Flaky tests miss real failures. Slow tests delay feedback. Replace them with faster, deterministic tests that actually catch bugs reliably, then keep a small E2E smoke suite for post-deployment verification. |
| “Our code is too tightly coupled to unit test” |
That is an architecture problem, not a testing problem. Start by writing tests for new code and refactoring existing code as you touch it. Use the Strangler Fig pattern - wrap untestable code in a testable layer. |
| “We don’t have time to rewrite the test suite” |
You are already paying the cost of the inverted pyramid in slow feedback, flaky builds, and manual verification. The fix is incremental: push one test down the pyramid each day. After a month, the suite is measurably faster and more reliable. |
Measuring Progress
| Metric |
What to look for |
| Test suite duration |
Should decrease toward under 10 minutes |
| Flaky test count in gating suite |
Should reach and stay at zero |
| Test distribution (unit : integration : E2E ratio) |
Unit tests should be the largest category |
| Pipeline pass rate |
Should increase as flaky tests are removed |
| Developers running tests locally |
Should increase as the suite gets faster |
| External dependencies in gating tests |
Should reach zero |
Related Content
4 - Pipeline and Infrastructure
Anti-patterns in build pipelines, deployment automation, and infrastructure management that block continuous delivery.
These anti-patterns affect the automated path from commit to production. They create manual steps,
slow feedback, and fragile deployments that prevent the reliable, repeatable delivery that
continuous delivery requires.
4.1 - No Pipeline Exists
Builds and deployments are manual processes. Someone runs a script on their laptop. There is no automated path from commit to production.
Category: Pipeline & Infrastructure | Quality Impact: Critical
What This Looks Like
Deploying to production requires a person. Someone opens a terminal, SSHs into a server, pulls the
latest code, runs a build command, and restarts a service. Or they download an artifact from a
shared drive, copy it to the right server, and run an install script. The steps live in a wiki page,
a shared document, or in someone’s head. Every deployment is a manual operation performed by
whoever knows the procedure.
There is no automation connecting a code commit to a running system. A developer finishes a feature,
pushes to the repository, and then a separate human process begins: someone must decide it is time
to deploy, gather the right artifacts, prepare the target environment, execute the deployment, and
verify that it worked. Each of these steps involves manual effort and human judgment.
The deployment procedure is a craft. Certain people are known for being “good at deploys.” New team
members are warned not to attempt deployments alone. When the person who knows the procedure is
unavailable, deployments wait. The team has learned to treat deployment as a risky, specialized
activity that requires care and experience.
Common variations:
- The deploy script on someone’s laptop. A shell script that automates some steps, but it lives
on one developer’s machine. Nobody else has it. When that developer is out, the team either waits
or reverse-engineers the procedure from the wiki.
- The manual checklist. A document with 30 steps: “SSH into server X, run this command, check
this log file, restart this service.” The checklist is usually out of date. Steps are missing or
in the wrong order. The person deploying adds corrections in the margins.
- The “only Dave can deploy” pattern. One person has the credentials, the knowledge, and the
muscle memory to deploy reliably. Deployments are scheduled around Dave’s availability. Dave
is a single point of failure and cannot take vacation during release weeks.
- The FTP deployment. Build artifacts are uploaded to a server via FTP, SCP, or a file share.
The person deploying must know which files go where, which config files to update, and which
services to restart. A missed file means a broken deployment.
- The manual build. There is no automated build at all. A developer runs the build command
locally, checks that it compiles, and copies the output to the deployment target. The build
that was tested is not necessarily the build that gets deployed.
The telltale sign: if deploying requires a specific person, a specific machine, or a specific
document that must be followed step by step, no pipeline exists.
Why This Is a Problem
The absence of a pipeline means every deployment is a unique event. No two deployments are
identical because human hands are involved in every step. This creates risk, waste, and
unpredictability that compound with every release.
It reduces quality
Without a pipeline, there is no enforced quality gate between a developer’s commit and production.
Tests may or may not be run before deploying. Static analysis may or may not be checked. The
artifact that reaches production may or may not be the same artifact that was tested. Every “may
or may not” is a gap where defects slip through.
Manual deployments also introduce their own defects. A step skipped in the checklist, a wrong
version of a config file, a service restarted in the wrong order - these are deployment bugs that
have nothing to do with the code. They are caused by the deployment process itself. The more manual
steps involved, the more opportunities for human error.
A pipeline eliminates both categories of risk. Every commit passes through the same automated
checks. The artifact that is tested is the artifact that is deployed. There are no skipped steps
because the steps are encoded in the pipeline definition and execute the same way every time.
It increases rework
Manual deployments are slow, so teams batch changes to reduce deployment frequency. Batching means
more changes per deployment. More changes means harder debugging when something goes wrong, because
any of dozens of commits could be the cause. The team spends hours bisecting changes to find the
one that broke production.
Failed manual deployments create their own rework. A deployment that goes wrong must be diagnosed,
rolled back (if rollback is even possible), and re-attempted. Each re-attempt burns time and
attention. If the deployment corrupted data or left the system in a partial state, the recovery
effort dwarfs the original deployment.
Rework also accumulates in the deployment procedure itself. Every deployment surfaces a new edge
case or a new prerequisite that was not in the checklist. Someone updates the wiki. The next
deployer reads the old version. The procedure is never quite right because manual procedures
cannot be versioned, tested, or reviewed the way code can.
With an automated pipeline, deployments are fast and repeatable. Small changes deploy individually.
Failed deployments are rolled back automatically. The pipeline definition is code - versioned,
reviewed, and tested like any other part of the system.
It makes delivery timelines unpredictable
A manual deployment takes an unpredictable amount of time. The optimistic case is 30 minutes. The
realistic case includes troubleshooting unexpected errors, waiting for the right person to be
available, and re-running steps that failed. A “quick deploy” can easily consume half a day.
The team cannot commit to release dates because the deployment itself is a variable. “We can deploy
on Tuesday” becomes “we can start the deployment on Tuesday, and we’ll know by Wednesday whether it
worked.” Stakeholders learn that deployment dates are approximate, not firm.
The unpredictability also limits deployment frequency. If each deployment takes hours of manual
effort and carries risk of failure, the team deploys as infrequently as possible. This increases
batch size, which increases risk, which makes deployments even more painful, which further
discourages frequent deployment. The team is trapped in a cycle where the lack of a pipeline makes
deployments costly, and costly deployments make the lack of a pipeline seem acceptable.
An automated pipeline makes deployment duration fixed and predictable. A deploy takes the same
amount of time whether it happens once a month or ten times a day. The cost per deployment drops
to near zero, removing the incentive to batch.
It concentrates knowledge in too few people
When deployment is manual, the knowledge of how to deploy lives in people rather than in code. The
team depends on specific individuals who know the servers, the credentials, the order of
operations, and the workarounds for known issues. These individuals become bottlenecks and single
points of failure.
When the deployment expert is unavailable - sick, on vacation, or has left the company - the team
is stuck. Someone else must reconstruct the deployment procedure from incomplete documentation and
trial and error. Deployments attempted by inexperienced team members fail at higher rates, which
reinforces the belief that only experts should deploy.
A pipeline encodes deployment knowledge in an executable definition that anyone can run. New team
members deploy on their first day by triggering the pipeline. The deployment expert’s knowledge is
preserved in code rather than in their head. The bus factor for deployments moves from one to the
entire team.
Impact on continuous delivery
Continuous delivery requires an automated, repeatable pipeline that can take any commit from trunk
and deliver it to production with confidence. Without a pipeline, none of this is possible. There
is no automation to repeat. There is no confidence that the process will work the same way twice.
There is no path from commit to production that does not require a human to drive it.
The pipeline is not an optimization of manual deployment. It is a prerequisite for CD. A team
without a pipeline cannot practice CD any more than a team without source control can practice
version management. The pipeline is the foundation. Everything else - automated testing, deployment
strategies, progressive rollouts, fast rollback - depends on it existing.
How to Fix It
Step 1: Document the current manual process exactly (Week 1)
Before automating, capture what the team actually does today. Have the person who deploys most
often write down every step in order:
- What commands do they run?
- What servers do they connect to?
- What credentials do they use?
- What checks do they perform before, during, and after?
- What do they do when something goes wrong?
This document is not the solution - it is the specification for the first version of the pipeline.
Every manual step will become an automated step.
Step 2: Automate the build (Week 2)
Start with the simplest piece: turning source code into a deployable artifact without manual
intervention.
- Choose a CI server (Jenkins, GitHub Actions, GitLab CI, CircleCI, or any tool that triggers on
commit).
- Configure it to check out the code and run the build command on every push to trunk.
- Store the build output as a versioned artifact.
At this point, the team has an automated build but still deploys manually. That is fine. The
pipeline will grow incrementally.
Step 3: Add automated tests to the build (Week 3)
If the team has any automated tests, add them to the pipeline so they run after the build
succeeds. If the team has no automated tests, add one. A single test that verifies the application
starts up is more valuable than zero tests.
The pipeline should now fail if the build fails or if any test fails. This is the first automated
quality gate. No artifact is produced unless the code compiles and the tests pass.
Step 4: Automate the deployment to a non-production environment (Weeks 3-4)
Take the manual deployment steps from Step 1 and encode them in a script or pipeline stage that
deploys the tested artifact to a staging or test environment:
- Provision or configure the target environment.
- Deploy the artifact.
- Run a smoke test to verify the deployment succeeded.
The team now has a pipeline that builds, tests, and deploys to a non-production environment on
every commit. Deployments to this environment should happen without any human intervention.
Step 5: Extend the pipeline to production (Weeks 5-6)
Once the team trusts the automated deployment to non-production environments, extend it to
production:
- Add a manual approval gate if the team is not yet comfortable with fully automated production
deployments. This is a temporary step - the goal is to remove it later.
- Use the same deployment script and process for production that you use for non-production. The
only difference should be the target environment and its configuration.
- Add post-deployment verification: health checks, smoke tests, or basic monitoring checks that
confirm the deployment is healthy.
The first automated production deployment will be nerve-wracking. That is normal. Run it alongside
the manual process the first few times: deploy automatically, then verify manually. As confidence
grows, drop the manual verification.
Step 6: Address the objections (Ongoing)
| Objection |
Response |
| “Our deployments are too complex to automate” |
If a human can follow the steps, a script can execute them. Complex deployments benefit the most from automation because they have the most opportunities for human error. |
| “We don’t have time to build a pipeline” |
You are already spending time on every manual deployment. A pipeline is an investment that pays back on the second deployment and every deployment after. |
| “Only Dave knows how to deploy” |
That is the problem, not a reason to keep the status quo. Building the pipeline captures Dave’s knowledge in code. Dave should lead the pipeline effort because he knows the procedure best. |
| “What if the pipeline deploys something broken?” |
The pipeline includes automated tests and can include approval gates. A broken deployment from a pipeline is no worse than a broken deployment from a human - and the pipeline can roll back automatically. |
| “Our infrastructure doesn’t support modern CI/CD tools” |
Start with a shell script triggered by a cron job or a webhook. A pipeline does not require Kubernetes or cloud-native infrastructure. It requires automation of the steps you already perform manually. |
Measuring Progress
| Metric |
What to look for |
| Manual steps in the deployment process |
Should decrease to zero |
| Deployment duration |
Should decrease and stabilize as manual steps are automated |
| Release frequency |
Should increase as deployment cost drops |
| Deployment failure rate |
Should decrease as human error is removed |
| People who can deploy to production |
Should increase from one or two to the entire team |
| Lead time |
Should decrease as the manual deployment bottleneck is eliminated |
Related Content
4.2 - Manual Deployments
The build is automated but deployment is not. Someone must SSH into servers, run scripts, and shepherd each release to production by hand.
Category: Pipeline & Infrastructure | Quality Impact: High
What This Looks Like
The team has a CI server. Code is built and tested automatically on every push. The pipeline
dashboard is green. But between “pipeline passed” and “code running in production,” there is a
person. Someone must log into a deployment tool, click a button, select the right artifact, choose
the right environment, and watch the output scroll by. Or they SSH into servers, pull the artifact,
run migration scripts, restart services, and verify health checks - all by hand.
The team may not even think of this as a problem. The build is automated. The tests run
automatically. Deployment is “just the last step.” But that last step takes 30 minutes to an hour
of focused human attention, can only happen when the right person is available, and fails often
enough that nobody wants to do it on a Friday afternoon.
Deployment has its own rituals. The team announces in Slack that a deploy is starting. Other
developers stop merging. Someone watches the logs. Another person checks the monitoring dashboard.
When it is done, someone posts a confirmation. The whole team holds its breath during the process
and exhales when it works. This ceremony happens every time, whether the release is one commit or
fifty.
Common variations:
- The button-click deploy. The CI/CD tool has a “deploy to production” button, but a human must
click it and then monitor the result. The automation exists but is not trusted to run
unattended. Someone watches every deployment from start to finish.
- The runbook deploy. A document describes the deployment steps in order. The deployer follows
the runbook, executing commands manually at each step. The runbook was written months ago and
has handwritten corrections in the margins. Some steps have been added, others crossed out.
- The SSH-and-pray deploy. The deployer SSHs into each server individually, pulls code or
copies artifacts, runs scripts, and restarts services. The order matters. Missing a server means
a partial deployment. The deployer keeps a mental checklist of which servers are done.
- The release coordinator deploy. One person coordinates the deployment across multiple systems.
They send messages to different teams: “deploy service A now,” “run the database migration,”
“restart the cache.” The deployment is a choreographed multi-person event.
- The after-hours deploy. Deployments happen only outside business hours because the manual
process is risky enough that the team wants minimal user traffic. Deployers work evenings or
weekends. The deployment window is sacred and stressful.
The telltale sign: if the pipeline is green but the team still needs to “do a deploy” as a
separate activity, deployment is manual.
Why This Is a Problem
A manual deployment negates much of the value that an automated build and test pipeline provides.
The pipeline can validate code in minutes, but if the last mile to production requires a human,
the delivery speed is limited by that human’s availability, attention, and reliability.
It reduces quality
Manual deployment introduces a category of defects that have nothing to do with the code. A
deployer who runs migration scripts in the wrong order corrupts data. A deployer who forgets to
update a config file on one of four servers creates inconsistent behavior. A deployer who restarts
services too quickly triggers a cascade of connection errors. These are process defects - bugs
introduced by the deployment method, not the software.
Manual deployments also degrade the quality signal from the pipeline. The pipeline tests a specific
artifact in a specific configuration. If the deployer manually adjusts configuration, selects a
different artifact version, or skips a verification step, the deployed system no longer matches
what the pipeline validated. The pipeline said “this is safe to deploy,” but what actually reached
production is something slightly different.
Automated deployment eliminates process defects by executing the same steps in the same order
every time. The artifact the pipeline tested is the artifact that reaches production. Configuration
is applied from version-controlled definitions, not from human memory. The deployment is identical
whether it happens at 2 PM on Tuesday or 3 AM on Saturday.
It increases rework
Because manual deployments are slow and risky, teams batch changes. Instead of deploying each
commit individually, they accumulate a week or two of changes and deploy them together. When
something breaks in production, the team must determine which of thirty commits caused the problem.
This diagnosis takes hours. The fix takes more hours. If the fix itself requires a deployment, the
team must go through the manual process again.
Failed deployments are especially costly. A manual deployment that leaves the system in a broken
state requires manual recovery. The deployer must diagnose what went wrong, decide whether to roll
forward or roll back, and execute the recovery steps by hand. If the deployment was a multi-server
process and some servers are on the new version while others are on the old version, the recovery
is even harder. The team may spend more time recovering from a failed deployment than they spent
on the deployment itself.
With automated deployments, each commit deploys individually. When something breaks, the cause is
obvious - it is the one commit that just deployed. Rollback is a single action, not a manual
recovery effort. The time from “something is wrong” to “the previous version is running” is
minutes, not hours.
It makes delivery timelines unpredictable
The gap between “pipeline is green” and “code is in production” is measured in human availability.
If the deployer is in a meeting, the deployment waits. If the deployer is on vacation, the
deployment waits longer. If the deployment fails and the deployer needs help, the recovery depends
on who else is around.
This human dependency makes release timing unpredictable. The team cannot promise “this fix will be
in production in 30 minutes” because the deployment requires a person who may not be available for
hours. Urgent fixes wait for deployment windows. Critical patches wait for the release coordinator
to finish lunch.
The batching effect adds another layer of unpredictability. When teams batch changes to reduce
deployment frequency, each deployment becomes larger and riskier. Larger deployments take longer to
verify and are more likely to fail. The team cannot predict how long the deployment will take
because they cannot predict what will go wrong with a batch of thirty changes.
Automated deployment makes the time from “pipeline green” to “running in production” fixed and
predictable. It takes the same number of minutes regardless of who is available, what day it is,
or how many other things are happening. The team can promise delivery timelines because the
deployment is a deterministic process, not a human activity.
It prevents fast recovery
When production breaks, speed of recovery determines the blast radius. A team that can deploy a
fix in five minutes limits the damage. A team that needs 45 minutes of manual deployment work
exposes users to the problem for 45 minutes plus diagnosis time.
Manual rollback is even worse. Many teams with manual deployments have no practiced rollback
procedure at all. “Rollback” means “re-deploy the previous version,” which means running the
entire manual deployment process again with a different artifact. If the deployment process takes
an hour, rollback takes an hour. If the deployment process requires a specific person, rollback
requires that same person.
Some manual deployments cannot be cleanly rolled back. Database migrations that ran during the
deployment may not have reverse scripts. Config changes applied to servers may not have been
tracked. The team is left doing a forward fix under pressure, manually deploying a patch through
the same slow process that caused the problem.
Automated pipelines with automated rollback can revert to the previous version in minutes. The
rollback follows the same tested path as the deployment. No human judgment is required. The team’s
mean time to repair drops from hours to minutes.
Impact on continuous delivery
Continuous delivery means any commit that passes the pipeline can be released to production at any
time with confidence. Manual deployment breaks this definition at “at any time.” The commit can
only be released when a human is available to perform the deployment, when the deployment window
is open, and when the team is ready to dedicate attention to watching the process.
The manual deployment step is the bottleneck that limits everything upstream. The pipeline can
validate commits in 10 minutes, but if deployment takes an hour of human effort, the team will
never deploy more than a few times per day at best. In practice, teams with manual deployments
release weekly or biweekly because the deployment overhead makes anything more frequent
impractical.
The pipeline is only half the delivery system. Automating the build and tests without automating
the deployment is like paving a highway that ends in a dirt road. The speed of the paved section
is irrelevant if every journey ends with a slow, bumpy last mile.
How to Fix It
Step 1: Script the current manual process (Week 1)
Take the runbook, the checklist, or the knowledge in the deployer’s head and turn it into a
script. Do not redesign the process yet - just encode what the team already does.
- Record a deployment from start to finish. Note every command, every server, every check.
- Write a script that executes those steps in order.
- Store the script in version control alongside the application code.
The script will be rough. It will have hardcoded values and assumptions. That is fine. The goal
is to make the deployment reproducible by any team member, not to make it perfect.
Step 2: Run the script from the pipeline (Week 2)
Connect the deployment script to the CI/CD pipeline so it runs automatically after the build and
tests pass. Start with a non-production environment:
- Add a deployment stage to the pipeline that targets a staging or test environment.
- Trigger it automatically on every successful build.
- Add a smoke test after deployment to verify it worked.
The team now gets automatic deployments to a non-production environment on every commit. This
builds confidence in the automation and surfaces problems early.
Step 3: Externalize configuration and secrets (Weeks 2-3)
Manual deployments often involve editing config files on servers or passing environment-specific
values by hand. Move these out of the manual process:
- Store environment-specific configuration in a config management system or environment variables
managed by the pipeline.
- Move secrets to a secrets manager (Vault, AWS Secrets Manager, Azure Key Vault, or even
encrypted pipeline variables as a starting point).
- Ensure the deployment script reads configuration from these sources rather than from hardcoded
values or manual input.
This step is critical because manual configuration is one of the most common sources of deployment
failures. Automating deployment without automating configuration just moves the manual step.
Step 4: Automate production deployment with a gate (Weeks 3-4)
Extend the pipeline to deploy to production using the same script and process:
- Add a production deployment stage after the non-production deployment succeeds.
- Include a manual approval gate - a button that a team member clicks to authorize the production
deployment. This is a temporary safety net while the team builds confidence.
- Add post-deployment health checks that automatically verify the deployment succeeded.
- Add automated rollback that triggers if the health checks fail.
The approval gate means a human still decides when to deploy, but the deployment itself is fully
automated. No SSHing. No manual steps. No watching logs scroll by.
Step 5: Remove the manual gate (Weeks 6-8)
Once the team has seen the automated production deployment succeed repeatedly, remove the manual
approval gate. The pipeline now deploys to production automatically when all checks pass.
This is the hardest step emotionally. The team will resist. Expect these objections:
| Objection |
Response |
| “We need a human to decide when to deploy” |
Why? If the pipeline validates the code and the deployment process is automated and tested, what decision is the human making? If the answer is “checking that nothing looks weird,” that check should be automated. |
| “What if it deploys during peak traffic?” |
Use deployment windows in the pipeline configuration, or use progressive rollout strategies that limit blast radius regardless of traffic. |
| “We had a bad deployment last month” |
Was it caused by the automation or by a gap in testing? If the tests missed a defect, the fix is better tests, not a manual gate. If the deployment process itself failed, the fix is better deployment automation, not a human watching. |
| “Compliance requires manual approval” |
Review the actual compliance requirement. Most require evidence of approval, not a human clicking a button at deployment time. A code review approval, an automated policy check, or an audit log of the pipeline run often satisfies the requirement. |
| “Our deployments require coordination with other teams” |
Automate the coordination. Use API contracts, deployment dependencies in the pipeline, or event-based triggers. If another team must deploy first, encode that dependency rather than coordinating in Slack. |
Step 6: Add deployment observability (Ongoing)
Once deployments are automated, invest in knowing whether they worked:
- Monitor error rates, latency, and key business metrics after every deployment.
- Set up automatic rollback triggers tied to these metrics.
- Track deployment frequency, duration, and failure rate over time.
The team should be able to deploy without watching. The monitoring watches for them.
Measuring Progress
| Metric |
What to look for |
| Manual steps per deployment |
Should reach zero |
| Deployment duration (human time) |
Should drop from hours to zero - the pipeline does the work |
| Release frequency |
Should increase as deployment friction drops |
| Change fail rate |
Should decrease as manual process defects are eliminated |
| Mean time to repair |
Should decrease as rollback becomes automated |
| Lead time |
Should decrease as the deployment bottleneck is removed |
Related Content
4.3 - Snowflake Environments
Each environment is hand-configured and unique. Nobody knows exactly what is running where. Configuration drift is constant.
Category: Pipeline & Infrastructure | Quality Impact: High
What This Looks Like
Staging has a different version of the database than production. The dev environment has a library
installed that nobody remembers adding. Production has a configuration file that was edited by hand
six months ago during an incident and never committed to source control. Nobody is sure all three
environments are running the same OS patch level.
A developer asks “why does this work in staging but not in production?” The answer takes hours to
find because it requires comparing configurations across environments by hand - diffing config
files, checking installed packages, verifying environment variables one by one.
Common variations:
- The hand-built server. Someone provisioned the production server two years ago. They followed
a wiki page that has since been edited, moved, or deleted. Nobody has provisioned a new one
since. If the server dies, nobody is confident they can recreate it.
- The magic SSH session. During an incident, someone SSH-ed into production and changed a
config value. It fixed the problem. Nobody updated the deployment scripts, the infrastructure
code, or the documentation. The next deployment overwrites the fix - or doesn’t, depending on
which files the deployment touches.
- The shared dev environment. A single development or staging environment is shared by the
whole team. One developer installs a library, another changes a config value, a third adds a
cron job. The environment drifts from any known baseline within weeks.
- The “production is special” mindset. Dev and staging environments are provisioned with
scripts, but production was set up differently because of “security requirements” or “scale
differences.” The result is that the environments the team tests against are structurally
different from the one that serves users.
- The environment with a name. Environments have names like “staging-v2” or “qa-new” because
someone created a new one alongside the old one. Both still exist. Nobody is sure which one the
pipeline deploys to.
The telltale sign: deploying the same artifact to two environments produces different results,
and the team’s first instinct is to check environment configuration rather than application code.
Why This Is a Problem
Snowflake environments undermine the fundamental premise of testing: that the behavior you observe
in one environment predicts the behavior you will see in another. When every environment is
unique, testing in staging tells you what works in staging - nothing more.
It reduces quality
When environments differ, bugs hide in the gaps. An application that works in staging may fail in
production because of a different library version, a missing environment variable, or a filesystem
permission that was set by hand. These bugs are invisible to testing because the test environment
does not reproduce the conditions that trigger them.
The team learns this the hard way, one production incident at a time. Each incident teaches the
team that “passed in staging” does not mean “will work in production.” This erodes trust in the
entire testing and deployment process. Developers start adding manual verification steps -
checking production configs by hand before deploying, running smoke tests manually after
deployment, asking the ops team to “keep an eye on things.”
When environments are identical and provisioned from the same code, the gap between staging and
production disappears. What works in staging works in production because the environments are the
same. Testing produces reliable results.
It increases rework
Snowflake environments cause two categories of rework. First, developers spend hours debugging
environment-specific issues that have nothing to do with application code. “Why does this work on
my machine but not in CI?” leads to comparing configurations, googling error messages related to
version mismatches, and patching environments by hand. This time is pure waste.
Second, production incidents caused by environment drift require investigation, rollback, and
fixes to both the application and the environment. A configuration difference that causes a
production failure might take five minutes to fix once identified, but identifying it takes hours
because nobody knows what the correct configuration should be.
Teams with reproducible environments spend zero time on environment debugging. If an environment
is wrong, they destroy it and recreate it from code. The investigation time drops from hours to
minutes.
It makes delivery timelines unpredictable
Deploying to a snowflake environment is unpredictable because the environment itself is an
unknown variable. The same deployment might succeed on Monday and fail on Friday because someone
changed something in the environment between the two deploys. The team cannot predict how long a
deployment will take because they cannot predict what environment issues they will encounter.
This unpredictability compounds across environments. A change must pass through dev, staging, and
production, and each environment is a unique snowflake with its own potential for surprise. A
deployment that should take minutes takes hours because each environment reveals a new
configuration issue.
Reproducible environments make deployment time a constant. The same artifact deployed to the same
environment specification produces the same result every time. Deployment becomes a predictable
step in the pipeline rather than an adventure.
It makes environments a scarce resource
When environments are hand-configured, creating a new one is expensive. It takes hours or days of
manual work. The team has a small number of shared environments and must coordinate access. “Can
I use staging today?” becomes a daily question. Teams queue up for access to the one environment
that resembles production.
This scarcity blocks parallel work. Two developers who both need to test a database migration
cannot do so simultaneously if there is only one staging environment. One waits while the other
finishes. Features that could be validated in parallel are serialized through a shared
environment bottleneck.
When environments are defined as code, spinning up a new one is a pipeline step that takes
minutes. Each developer or feature branch can have its own environment. There is no contention
because environments are disposable and cheap.
Impact on continuous delivery
Continuous delivery requires that any change can move from commit to production through a fully
automated pipeline. Snowflake environments break this in multiple ways. The pipeline cannot
provision environments automatically if environments are hand-configured. Testing results are
unreliable because environments differ. Deployments fail unpredictably because of configuration
drift.
A team with snowflake environments cannot trust their pipeline. They cannot deploy frequently
because each deployment risks hitting an environment-specific issue. They cannot automate
fully because the environments require manual intervention. The path from commit to production
is neither continuous nor reliable.
How to Fix It
Step 1: Document what exists today (Week 1)
Before automating anything, capture the current state of each environment:
- For each environment (dev, staging, production), record: OS version, installed packages,
configuration files, environment variables, external service connections, and any manual
customizations.
- Diff the environments against each other. Note every difference.
- Classify each difference as intentional (e.g., production uses a larger instance size) or
accidental (e.g., staging has an old library version nobody updated).
This audit surfaces the drift. Most teams are surprised by how many accidental differences exist.
Step 2: Define one environment specification (Weeks 2-3)
Choose an infrastructure-as-code tool (Terraform, Pulumi, CloudFormation, Ansible, or similar)
and write a specification for one environment. Start with the environment you understand best -
usually staging.
The specification should define:
- Base infrastructure (servers, containers, networking)
- Installed packages and their versions
- Configuration files and their contents
- Environment variables with placeholder values
- Any scripts that run at provisioning time
Verify the specification by destroying the staging environment and recreating it from code. If
the recreated environment works, the specification is correct. If it does not, fix the
specification until it does.
Step 3: Parameterize for environment differences (Week 3)
Intentional differences between environments (instance sizes, database connection strings, API
keys) become parameters, not separate specifications. One specification with environment-specific
variables:
| Parameter |
Dev |
Staging |
Production |
| Instance size |
small |
medium |
large |
| Database host |
dev-db.internal |
staging-db.internal |
prod-db.internal |
| Log level |
debug |
info |
warn |
| Replica count |
1 |
2 |
3 |
The structure is identical. Only the values change. This eliminates accidental drift because every
environment is built from the same template.
Step 4: Provision environments through the pipeline (Week 4)
Add environment provisioning to the deployment pipeline:
- Before deploying to an environment, the pipeline provisions (or updates) it from the
infrastructure code.
- The application artifact is deployed to the freshly provisioned environment.
- If provisioning or deployment fails, the pipeline fails - no manual intervention.
This closes the loop. Environments cannot drift because they are recreated or reconciled on
every deployment. Manual SSH sessions and hand edits have no lasting effect because the next
pipeline run overwrites them.
Step 5: Make environments disposable (Week 5+)
The ultimate goal is that any environment can be destroyed and recreated in minutes with no data
loss and no human intervention:
- Practice destroying and recreating staging weekly. This verifies the specification stays
accurate and builds team confidence.
- Provision ephemeral environments for feature branches or pull requests. Let the pipeline
create and destroy them automatically.
- If recreating production is not feasible yet (stateful systems, licensing), ensure you can
provision a production-identical environment for testing at any time.
| Objection |
Response |
| “Production has unique requirements we can’t codify” |
If a requirement exists only in production and is not captured in code, it is at risk of being lost. Codify it. If it is truly unique, it belongs in a parameter, not a hand-edit. |
| “We don’t have time to learn infrastructure-as-code” |
You are already spending that time debugging environment drift. The investment pays for itself within weeks. Start with the simplest tool that works for your platform. |
| “Our environments are managed by another team” |
Work with them. Provide the specification. If they provision from your code, you both benefit: they have a reproducible process and you have predictable environments. |
| “Containers solve this problem” |
Containers solve application-level consistency. You still need infrastructure-as-code for the platform the containers run on - networking, storage, secrets, load balancers. Containers are part of the solution, not the whole solution. |
Measuring Progress
| Metric |
What to look for |
| Environment provisioning time |
Should decrease from hours/days to minutes |
| Configuration differences between environments |
Should reach zero accidental differences |
| “Works in staging but not production” incidents |
Should drop to near zero |
| Change fail rate |
Should decrease as environment parity improves |
| Mean time to repair |
Should decrease as environments become reproducible |
| Time spent debugging environment issues |
Track informally - should approach zero |
Related Content
5 - Organizational and Cultural
Anti-patterns in team culture, management practices, and organizational structure that block continuous delivery.
These anti-patterns affect the human and organizational side of delivery. They create
misaligned incentives, erode trust, and block the cultural changes that continuous delivery
requires. Technical practices alone cannot overcome a culture that works against them.
5.1 - Change Advisory Board Gates
Manual committee approval required for every production change. Meetings are weekly. One-line fixes wait alongside major migrations.
Category: Organizational & Cultural | Quality Impact: High
What This Looks Like
Before any change can reach production, it must be submitted to the Change Advisory Board. The
developer fills out a change request form: description of the change, impact assessment, rollback
plan, testing evidence, and approval signatures. The form goes into a queue. The CAB meets once
a week - sometimes every two weeks - to review the queue. Each change gets a few minutes of
discussion. The board approves, rejects, or requests more information.
A one-line configuration fix that a developer finished on Monday waits until Thursday’s CAB
meeting. If the board asks a question, the change waits until the next meeting. A two-line bug
fix sits in the same queue as a database migration, reviewed by the same people with the same
ceremony.
Common variations:
- The rubber-stamp CAB. The board approves everything. Nobody reads the change requests
carefully because the volume is too high and the context is too shallow. The meeting exists
to satisfy an audit requirement, not to catch problems. It adds delay without adding safety.
- The bottleneck approver. One person on the CAB must approve every change. That person is
in six other meetings, has 40 pending reviews, and is on vacation next week. Deployments
stop when they are unavailable.
- The emergency change process. Urgent fixes bypass the CAB through an “emergency change”
procedure that requires director-level approval and a post-hoc review. The emergency process
is faster, so teams learn to label everything urgent. The CAB process is for scheduled changes,
and fewer changes are scheduled.
- The change freeze. Certain periods - end of quarter, major events, holidays - are declared
change-free zones. No production changes for days or weeks. Changes pile up during the freeze
and deploy in a large batch afterward, which is exactly the high-risk event the freeze was
meant to prevent.
- The form-driven process. The change request template has 15 fields, most of which are
irrelevant for small changes. Developers spend more time filling out the form than making the
change. Some fields require information the developer does not have, so they make something up.
The telltale sign: a developer finishes a change and says “now I need to submit it to the CAB”
with the same tone they would use for “now I need to go to the dentist.”
Why This Is a Problem
CAB gates exist to reduce risk. In practice, they increase risk by creating delay, encouraging
batching, and providing a false sense of security. The review is too shallow to catch real
problems and too slow to enable fast delivery.
It reduces quality
A CAB review is a review by people who did not write the code, did not test it, and often do not
understand the system it affects. A board member scanning a change request form for five minutes
cannot assess the quality of a code change. They can check that the form is filled out. They
cannot check that the change is safe.
The real quality checks - automated tests, code review by peers, deployment verification - happen
before the CAB sees the change. The CAB adds nothing to quality because it reviews paperwork, not
code. The developer who wrote the tests and the reviewer who read the diff know far more about
the change’s risk than a board member reading a summary.
Meanwhile, the delay the CAB introduces actively harms quality. A bug fix that is ready on Monday
but cannot deploy until Thursday means users experience the bug for three extra days. A security
patch that waits for weekly approval is a vulnerability window measured in days.
Teams without CAB gates deploy quality checks into the pipeline itself: automated tests, security
scans, peer review, and deployment verification. These checks are faster, more thorough, and
more reliable than a weekly committee meeting.
It increases rework
The CAB process generates significant administrative overhead. For every change, a developer must
write a change request, gather approval signatures, and attend (or wait for) the board meeting.
This overhead is the same whether the change is a one-line typo fix or a major feature.
When the CAB requests more information or rejects a change, the cycle restarts. The developer
updates the form, resubmits, and waits for the next meeting. A change that was ready to deploy
a week ago sits in a review loop while the developer has moved on to other work. Picking it back
up costs context-switching time.
The batching effect creates its own rework. When changes are delayed by the CAB process, they
accumulate. Developers merge multiple changes to avoid submitting multiple requests. Larger
batches are harder to review, harder to test, and more likely to cause problems. When a problem
occurs, it is harder to identify which change in the batch caused it.
It makes delivery timelines unpredictable
The CAB introduces a fixed delay into every deployment. If the board meets weekly, the minimum
time from “change ready” to “change deployed” is up to a week, depending on when the change
was finished relative to the meeting schedule. This delay is independent of the change’s size,
risk, or urgency.
The delay is also variable. A change submitted on Monday might be approved Thursday. A change
submitted on Friday waits until the following Thursday. If the board requests revisions, add
another week. Developers cannot predict when their change will reach production because the
timeline depends on a meeting schedule and a queue they do not control.
This unpredictability makes it impossible to make reliable commitments. When a stakeholder asks
“when will this be live?” the developer must account for development time plus an unpredictable
CAB delay. The answer becomes “sometime in the next one to three weeks” for a change that took
two hours to build.
It creates a false sense of security
The most dangerous effect of the CAB is the belief that it prevents incidents. It does not. The
board reviews paperwork, not running systems. A well-written change request for a dangerous
change will be approved. A poorly written request for a safe change will be questioned. The
correlation between CAB approval and deployment safety is weak at best.
Studies of high-performing delivery organizations consistently show that external change approval
processes do not reduce failure rates. The 2019 Accelerate State of DevOps Report found that
teams with external change approval had higher failure rates than teams using peer review and
automated checks. The CAB provides a feeling of control without the substance.
This false sense of security is harmful because it displaces investment in controls that
actually work. If the organization believes the CAB prevents incidents, there is less pressure
to invest in automated testing, deployment verification, and progressive rollout - the controls
that actually reduce deployment risk.
Impact on continuous delivery
Continuous delivery requires that any change can reach production quickly through an automated
pipeline. A weekly approval meeting is fundamentally incompatible with continuous deployment.
The math is simple. If the CAB meets weekly and reviews 20 changes per meeting, the maximum
deployment frequency is 20 per week. A team practicing CD might deploy 20 times per day. The
CAB process reduces deployment frequency by two orders of magnitude.
More importantly, the CAB process assumes that human review of change requests is a meaningful
quality gate. CD assumes that automated checks - tests, security scans, deployment verification -
are better quality gates because they are faster, more consistent, and more thorough. These are
incompatible philosophies. A team practicing CD replaces the CAB with pipeline-embedded controls
that provide equivalent (or superior) risk management without the delay.
How to Fix It
Eliminating the CAB outright is rarely possible because it exists to satisfy regulatory or
organizational governance requirements. The path forward is to replace the manual ceremony with
automated controls that satisfy the same requirements faster and more reliably.
Step 1: Classify changes by risk (Week 1)
Not all changes carry the same risk. Introduce a risk classification:
| Risk level |
Criteria |
Example |
Approval process |
| Standard |
Small, well-tested, automated rollback |
Config change, minor bug fix, dependency update |
Peer review + passing pipeline = auto-approved |
| Normal |
Medium scope, well-tested |
New feature behind a feature flag, API endpoint addition |
Peer review + passing pipeline + team lead sign-off |
| High |
Large scope, architectural, or compliance-sensitive |
Database migration, authentication change, PCI-scoped change |
Peer review + passing pipeline + architecture review |
The goal is to route 80-90% of changes through the standard process, which requires no CAB
involvement at all.
Step 2: Define pipeline controls that replace CAB review (Weeks 2-3)
For each concern the CAB currently addresses, implement an automated alternative:
| CAB concern |
Automated replacement |
| “Will this change break something?” |
Automated test suite with high coverage, pipeline-gated |
| “Is there a rollback plan?” |
Automated rollback built into the deployment pipeline |
| “Has this been tested?” |
Test results attached to every change as pipeline evidence |
| “Is this change authorized?” |
Peer code review with approval recorded in version control |
| “Do we have an audit trail?” |
Pipeline logs capture who changed what, when, with what test results |
Document these controls. They become the evidence that satisfies auditors in place of the CAB
meeting minutes.
Step 3: Pilot auto-approval for standard changes (Week 3)
Pick one team or one service as a pilot. Standard-risk changes from that team bypass the CAB
entirely if they meet the automated criteria:
- Code review approved by at least one peer.
- All pipeline stages passed (build, test, security scan).
- Change classified as standard risk.
- Deployment includes automated health checks and rollback capability.
Track the results: deployment frequency, change fail rate, and incident count. Compare with the
CAB-gated process.
Step 4: Present the data and expand (Weeks 4-8)
After a month of pilot data, present the results to the CAB and organizational leadership:
- How many changes were auto-approved?
- What was the change fail rate for auto-approved changes vs. CAB-reviewed changes?
- How much faster did auto-approved changes reach production?
- How many incidents were caused by auto-approved changes?
If the data shows that auto-approved changes are as safe or safer than CAB-reviewed changes
(which is the typical outcome), expand the auto-approval process to more teams and more change
types.
Step 5: Reduce the CAB to high-risk changes only (Week 8+)
With most changes flowing through automated approval, the CAB’s scope shrinks to genuinely
high-risk changes: major architectural shifts, compliance-sensitive changes, and cross-team
infrastructure modifications. These changes are infrequent enough that a review process is not
a bottleneck.
The CAB meeting frequency drops from weekly to as-needed. The board members spend their time on
changes that actually benefit from human review rather than rubber-stamping routine deployments.
| Objection |
Response |
| “The CAB is required by our compliance framework” |
Most compliance frameworks (SOX, PCI, HIPAA) require separation of duties and change control, not a specific meeting. Automated pipeline controls with audit trails satisfy the same requirements. Engage your auditors early to confirm. |
| “Without the CAB, anyone could deploy anything” |
The pipeline controls are stricter than the CAB. The CAB reviews a form for five minutes. The pipeline runs thousands of tests, security scans, and verification checks. Auto-approval is not no-approval - it is better approval. |
| “We’ve always done it this way” |
The CAB was designed for a world of monthly releases. In that world, reviewing 10 changes per month made sense. In a CD world with 10 changes per day, the same process becomes a bottleneck that adds risk instead of reducing it. |
| “What if an auto-approved change causes an incident?” |
What if a CAB-approved change causes an incident? (They do.) The question is not whether incidents happen but how quickly you detect and recover. Automated deployment verification and rollback detect and recover faster than any manual process. |
Measuring Progress
| Metric |
What to look for |
| Lead time |
Should decrease as CAB delay is removed for standard changes |
| Release frequency |
Should increase as deployment is no longer gated on weekly meetings |
| Change fail rate |
Should remain stable or decrease - proving auto-approval is safe |
| Percentage of changes auto-approved |
Should climb toward 80-90% |
| CAB meeting frequency |
Should decrease from weekly to as-needed |
| Time from “ready to deploy” to “deployed” |
Should drop from days to hours or minutes |
Related Content
5.2 - Pressure to Skip Testing
Management pressures developers to skip or shortcut testing to meet deadlines. The test suite rots sprint by sprint as skipped tests become the norm.
Category: Organizational & Cultural | Quality Impact: High
What This Looks Like
A deadline is approaching. The manager asks the team how things are going. A developer says the
feature is done but the tests still need to be written. The manager says “we’ll come back to the
tests after the release.” The tests are never written. Next sprint, the same thing happens. After
a few months, the team has a codebase with patches of coverage surrounded by growing deserts of
untested code.
Nobody made a deliberate decision to abandon testing. It happened one shortcut at a time, each
one justified by a deadline that felt more urgent than the test suite.
Common variations:
- “Tests are a nice-to-have.” The team treats test writing as optional scope that gets cut
when time is short. Features are estimated without testing time. Tests are a separate backlog
item that never reaches the top.
- “We’ll add tests in the hardening sprint.” Testing is deferred to a future sprint dedicated
to quality. That sprint gets postponed, shortened, or filled with the next round of urgent
features. The testing debt compounds.
- “Just get it out the door.” A manager or product owner explicitly tells developers to skip
tests for a specific release. The implicit message is that shipping matters and quality does
not. Developers who push back are seen as slow or uncooperative.
- The coverage ratchet in reverse. The team once had 70% test coverage. Each sprint, a few
untested changes slip through. Coverage drops to 60%, then 50%, then 40%. Nobody notices the
trend because each individual drop is small. By the time someone looks at the number, half the
safety net is gone.
- Testing theater. Developers write the minimum tests needed to pass a coverage gate - trivial
assertions, tests that verify getters and setters, tests that do not actually exercise
meaningful behavior. The coverage number looks healthy but the tests catch nothing.
The telltale sign: the team has a backlog of “write tests for X” tickets that are months old and
have never been started, while production incidents keep increasing.
Why This Is a Problem
Skipping tests feels like it saves time in the moment. It does not. It borrows time from the
future at a steep interest rate. The effects are invisible at first and catastrophic later.
It reduces quality
Every untested change is a change that nobody can verify automatically. The first few skipped
tests are low risk - the code is fresh in the developer’s mind and unlikely to break. But as
weeks pass, the untested code is modified by other developers who do not know the original intent.
Without tests to pin the behavior, regressions creep in undetected.
The damage accelerates. When half the codebase is untested, developers cannot tell which changes
are safe and which are risky. They treat every change as potentially dangerous, which slows them
down. Or they treat every change as probably fine, which lets bugs through. Either way, quality
suffers.
Teams that maintain their test suite catch regressions within minutes of introducing them. The
developer who caused the regression fixes it immediately because they are still working on the
relevant code. The cost of the fix is minutes, not days.
It increases rework
Untested code generates rework in two forms. First, bugs that would have been caught by tests
reach production and must be investigated, diagnosed, and fixed under pressure. A bug found by a
test costs minutes to fix. The same bug found in production costs hours - plus the cost of
the incident response, the rollback or hotfix, and the customer impact.
Second, developers working in untested areas of the codebase move slowly because they have no
safety net. They make a change, manually verify it, discover it broke something else, revert,
try again. Work that should take an hour takes a day because every change requires manual
verification.
The rework is invisible in sprint metrics. The team does not track “time spent debugging issues
that tests would have caught.” But it shows up in velocity: the team ships less and less each
sprint even as they work longer hours.
It makes delivery timelines unpredictable
When the test suite is healthy, the time from “code complete” to “deployed” is a known quantity.
The pipeline runs, tests pass, the change ships. When the test suite has been hollowed out by
months of skipped tests, that step becomes unpredictable. Some changes pass cleanly. Others
trigger production incidents that take days to resolve.
The manager who pressured the team to skip tests in order to hit a deadline ends up with less
predictable timelines, not more. Each skipped test is a small increase in the probability that a
future change will cause an unexpected failure. Over months, the cumulative probability climbs
until production incidents become a regular occurrence rather than an exception.
Teams with comprehensive test suites deliver predictably because the automated checks eliminate
the largest source of variance - undetected defects.
It creates a death spiral
The most dangerous aspect of this anti-pattern is that it is self-reinforcing. Skipping tests
leads to more bugs. More bugs lead to more time spent firefighting. More time firefighting means
less time for testing. Less testing means more bugs. The cycle accelerates.
At the same time, the codebase becomes harder to test. Code written without tests in mind tends
to be tightly coupled, dependent on global state, and difficult to isolate. The longer testing is
deferred, the more expensive it becomes to add tests later. The team’s estimate for “catching up
on testing” grows from days to weeks to months, making it even less likely that management will
allocate the time.
Eventually, the team reaches a state where the test suite is so degraded that it provides no
confidence. The team is effectively back to no test automation
but with the added burden of maintaining a broken test infrastructure that nobody trusts.
Impact on continuous delivery
Continuous delivery requires automated quality gates that the team can rely on. A test suite that
has been eroded by months of skipped tests is not a quality gate - it is a gate with widening
holes. Changes pass through it not because they are safe but because the tests that would have
caught the problems were never written.
A team cannot deploy continuously if they cannot verify continuously. When the manager says “skip
the tests, we need to ship,” they are not just deferring quality work. They are dismantling the
infrastructure that makes frequent, safe deployment possible.
How to Fix It
Step 1: Make the cost visible (Week 1)
The pressure to skip tests comes from a belief that testing is overhead rather than investment.
Change that belief with data:
- Count production incidents in the last 90 days. For each one, identify whether an automated
test could have caught it. Calculate the total hours spent on incident response.
- Measure the team’s change fail rate - the percentage of deployments that cause a failure or
require a rollback.
- Track how long manual verification takes per release. Sum the hours across the team.
Present these numbers to the manager applying pressure. Frame it concretely: “We spent 40 hours
on incident response last quarter. Thirty of those incidents would have been caught by tests that
we skipped.”
Step 2: Include testing in every estimate (Week 2)
Stop treating tests as separate work items that can be deferred:
- Agree as a team: no story is “done” until it has automated tests. This is a working agreement,
not a suggestion.
- Include testing time in every estimate. If a feature takes three days to build, the estimate is
three days - including tests. Testing is not additive; it is part of building the feature.
- Stop creating separate “write tests” tickets. Tests are part of the story, not a follow-up
task.
When a manager asks “can we skip the tests to ship faster?” the answer is “the tests are part of
shipping. Skipping them means the feature is not done.”
Step 3: Set a coverage floor and enforce it (Week 3)
Prevent further erosion with an automated guardrail:
- Measure current test coverage. Whatever it is - 30%, 50%, 70% - that is the floor.
- Configure the pipeline to fail if a change reduces coverage below the floor.
- Ratchet the floor up by 1-2 percentage points each month.
The floor makes the cost of skipping tests immediate and visible. A developer who skips tests
will see the pipeline fail. The conversation shifts from “we’ll add tests later” to “the pipeline
won’t let us merge without tests.”
Step 4: Recover coverage in high-risk areas (Weeks 3-6)
You cannot test everything retroactively. Prioritize the areas that matter most:
- Use version control history to find the files with the most changes and the most bug fixes.
These are the highest-risk areas.
- For each high-risk file, write tests for the core behavior - the functions that other code
depends on.
- Allocate a fixed percentage of each sprint (e.g., 20%) to writing tests for existing code.
This is not optional and not deferrable.
Step 5: Address the management pressure directly (Ongoing)
The root cause is a manager who sees testing as optional. This requires a direct conversation:
| What the manager says |
What to say back |
| “We don’t have time for tests” |
“We don’t have time for the production incidents that skipping tests causes. Last quarter, incidents cost us X hours.” |
| “Just this once, we’ll catch up later” |
“We said that three sprints ago. Coverage has dropped from 60% to 45%. There is no ’later’ unless we stop the bleeding now.” |
| “The customer needs this feature by Friday” |
“The customer also needs the application to work. Shipping an untested feature on Friday and a hotfix on Monday does not save time.” |
| “Other teams ship without this many tests” |
“Other teams with similar practices have a change fail rate of X%. Ours is Y%. The tests are why.” |
If the manager continues to apply pressure after seeing the data, escalate. Test suite erosion is
a technical risk that affects the entire organization’s ability to deliver. It is appropriate to
raise it with engineering leadership.
Measuring Progress
| Metric |
What to look for |
| Test coverage trend |
Should stop declining and begin climbing |
| Change fail rate |
Should decrease as coverage recovers |
| Production incidents from untested code |
Track root causes - “no test coverage” should become less frequent |
| Stories completed without tests |
Should drop to zero |
| Development cycle time |
Should stabilize as manual verification decreases |
| Sprint capacity spent on incident response |
Should decrease as fewer untested changes reach production |
Related Content
6 - Monitoring and Observability
Anti-patterns in monitoring, alerting, and observability that block continuous delivery.
These anti-patterns affect the team’s ability to see what is happening in production. They
create blind spots that make deployment risky, incident response slow, and confidence in
the delivery pipeline impossible to build.
6.1 - No Observability
The team cannot tell if a deployment is healthy. No metrics, no log aggregation, no tracing. Issues are discovered when customers call support.
Category: Monitoring & Observability | Quality Impact: High
What This Looks Like
The team deploys a change. Someone asks “is it working?” Nobody knows. There is no dashboard to
check. There are no metrics to compare before and after. The team waits. If nobody complains
within an hour, they assume the deployment was successful.
When something does go wrong, the team finds out from a customer support ticket, a Slack message
from another team, or an executive asking why the site is slow. The investigation starts with
SSH-ing into a server and reading raw log files. Hours pass before anyone understands what
happened, what caused it, or how many users were affected.
Common variations:
- Logs exist but are not aggregated. Each server writes its own log files. Debugging requires
logging into multiple servers and running grep. Correlating a request across services means
opening terminals to five machines and searching by timestamp.
- Metrics exist but nobody watches them. A monitoring tool was set up once. It has default
dashboards for CPU and memory. Nobody configured application-level metrics. The dashboards show
that servers are running, not whether the application is working.
- Alerting is all or nothing. Either there are no alerts, or there are hundreds of noisy
alerts that the team ignores. Real problems are indistinguishable from false alarms. The
on-call person mutes their phone.
- Observability is someone else’s job. A separate operations or platform team owns the
monitoring tools. The development team does not have access, does not know what is monitored,
and does not add instrumentation to their code.
- Post-deployment verification is manual. After every deployment, someone clicks through the
application to check if it works. This takes 15 minutes per deployment. It catches obvious
failures but misses performance degradation, error rate increases, and partial outages.
The telltale sign: the team’s primary method for detecting production problems is waiting for
someone outside the team to report them.
Why This Is a Problem
Without observability, the team is deploying into a void. They cannot verify that deployments
are healthy, cannot detect problems quickly, and cannot diagnose issues when they arise. Every
deployment is a bet that nothing will go wrong, with no way to check.
It reduces quality
When the team cannot see the effects of their changes in production, they cannot learn from them.
A deployment that degrades response times by 200 milliseconds goes unnoticed. A change that
causes a 2% increase in error rates is invisible. These small quality regressions accumulate
because nobody can see them.
Without production telemetry, the team also loses the most valuable feedback loop: how the
software actually behaves under real load with real data. A test suite can verify logic, but only
production observability reveals performance characteristics, usage patterns, and failure modes
that tests cannot simulate.
Teams with strong observability catch regressions within minutes of deployment. They see error
rate spikes, latency increases, and anomalous behavior in real time. They roll back or fix the
issue before most users are affected. Quality improves because the feedback loop from deployment
to detection is minutes, not days.
It increases rework
Without observability, incidents take longer to detect, longer to diagnose, and longer to resolve.
Each phase of the incident lifecycle is extended because the team is working blind.
Detection takes hours or days instead of minutes because the team relies on external reports.
Diagnosis takes hours instead of minutes because there are no traces, no correlated logs, and no
metrics to narrow the search. The team resorts to reading code and guessing. Resolution takes
longer because without metrics, the team cannot verify that their fix actually worked - they
deploy the fix and wait to see if the complaints stop.
A team with observability detects problems in minutes through automated alerts, diagnoses them
in minutes by following traces and examining metrics, and verifies fixes instantly by watching
dashboards. The total incident lifecycle drops from hours to minutes.
It makes delivery timelines unpredictable
Without observability, the team cannot assess deployment risk. They do not know the current error
rate, the baseline response time, or the system’s capacity. Every deployment might trigger an
incident that consumes the rest of the day, or it might go smoothly. The team cannot predict
which.
This uncertainty makes the team cautious. They deploy less frequently because each deployment is
a potential fire. They avoid deploying on Fridays, before holidays, or before important events.
They batch up changes so there are fewer risky deployment moments. Each of these behaviors slows
delivery and increases batch size, which increases risk further.
Teams with observability deploy with confidence because they can verify health immediately. A
deployment that causes a problem is detected and rolled back in minutes. The blast radius is
small because the team catches issues before they spread. This confidence enables frequent
deployment, which keeps batch sizes small, which reduces risk.
Impact on continuous delivery
Continuous delivery requires fast feedback from production. The deploy-and-verify cycle must be
fast enough that the team can deploy many times per day with confidence. Without observability,
there is no verification step - only hope.
Specifically, CD requires:
- Automated deployment verification. After every deployment, the pipeline must verify that the
new version is healthy before routing traffic to it. This requires health checks, metric
comparisons, and automated rollback triggers - all of which require observability.
- Fast incident detection. If a deployment causes a problem, the team must know within
minutes, not hours. Automated alerts based on error rates, latency, and business metrics
are essential.
- Confident rollback decisions. When a deployment looks unhealthy, the team must be able to
compare current metrics to the baseline and make a data-driven rollback decision. Without
metrics, rollback decisions are based on gut feeling and anecdote.
A team without observability can automate deployment, but they cannot automate verification. That
means every deployment requires manual checking, which caps deployment frequency at whatever pace
the team can manually verify.
How to Fix It
Step 1: Add structured logging (Week 1)
Structured logging is the foundation of observability. Without it, logs are unreadable at scale.
- Replace unstructured log statements (
log("processing order")) with structured ones
(log(event="order.processed", order_id=123, duration_ms=45)).
- Include a correlation ID in every log entry so that all log entries for a single request can
be linked together across services.
- Send logs to a central aggregation service (Elasticsearch, Datadog, CloudWatch, Loki, or
similar). Stop relying on SSH and grep.
Focus on the most critical code paths first: request handling, error paths, and external service
calls. You do not need to instrument everything in week one.
Step 2: Add application-level metrics (Week 2)
Infrastructure metrics (CPU, memory, disk) tell you the servers are running. Application metrics
tell you the software is working. Add the four golden signals:
| Signal |
What to measure |
Example |
| Latency |
How long requests take |
p50, p95, p99 response time per endpoint |
| Traffic |
How much demand the system handles |
Requests per second, messages processed per minute |
| Errors |
How often requests fail |
Error rate by endpoint, HTTP 5xx count |
| Saturation |
How full the system is |
Queue depth, connection pool usage, thread count |
Expose these metrics through your application (using Prometheus client libraries, StatsD, or
your platform’s metric SDK) and visualize them on a dashboard.
Step 3: Create a deployment health dashboard (Week 3)
Build a single dashboard that answers: “Is the system healthy right now?”
- Include the four golden signals from Step 2.
- Add deployment markers so the team can see when deploys happened and correlate them with
metric changes.
- Include business metrics that matter: successful checkouts per minute, sign-ups per hour,
or whatever your system’s key transactions are.
This dashboard becomes the first thing the team checks after every deployment. It replaces the
manual click-through verification.
Step 4: Add automated alerts for deployment verification (Week 4)
Move from “someone checks the dashboard” to “the system tells us when something is wrong”:
- Set alert thresholds based on your baseline metrics. If the p95 latency is normally 200ms,
alert when it exceeds 500ms for more than 2 minutes.
- Set error rate alerts. If the error rate is normally below 1%, alert when it crosses 5%.
- Connect alerts to the team’s communication channel (Slack, PagerDuty, or similar). Alerts
must reach the people who can act on them.
Start with a small number of high-confidence alerts. Three alerts that fire reliably are worth
more than thirty that the team ignores.
Step 5: Integrate observability into the deployment pipeline (Week 5+)
Close the loop between deployment and verification:
- After deploying, the pipeline waits and checks health metrics automatically. If error rates
spike or latency degrades beyond the threshold, the pipeline triggers an automatic rollback.
- Add smoke tests that run against the live deployment and report results to the dashboard.
- Implement canary deployments or progressive rollouts that route a small percentage of traffic
to the new version and compare its metrics against the baseline before promoting.
This is the point where observability enables continuous delivery. The pipeline can deploy with
confidence because it can verify health automatically.
| Objection |
Response |
| “We don’t have budget for monitoring tools” |
Open-source stacks (Prometheus, Grafana, Loki, Jaeger) provide full observability at zero license cost. The investment is setup time, not money. |
| “We don’t have time to add instrumentation” |
Start with the deployment health dashboard. One afternoon of work gives the team more production visibility than they have ever had. Build from there. |
| “The ops team handles monitoring” |
Observability is a development concern, not just an operations concern. Developers write the code that generates the telemetry. They need access to the dashboards and alerts. |
| “We’ll add observability after we stabilize” |
You cannot stabilize what you cannot see. Observability is how you find stability problems. Adding it later means flying blind longer. |
Measuring Progress
| Metric |
What to look for |
| Mean time to detect (MTTD) |
Time from problem occurring to team being aware - should drop from hours to minutes |
| Mean time to repair |
Should decrease as diagnosis becomes faster |
| Manual verification time per deployment |
Should drop to zero as automated checks replace manual click-throughs |
| Change fail rate |
Should decrease as deployment verification catches problems before they reach users |
| Alert noise ratio |
Percentage of alerts that are actionable - should be above 80% |
| Incidents discovered by customers vs. by the team |
Ratio should shift toward team detection |
Related Content
7 - Architecture
Anti-patterns in system architecture and design that block continuous delivery.
These anti-patterns affect the structure of the software itself. They create coupling that
makes independent deployment impossible, blast radii that make every change risky, and
boundaries that force teams to coordinate instead of delivering independently.
7.1 - Tightly Coupled Monolith
Changing one module breaks others. No clear boundaries. Every change is high-risk because blast radius is unpredictable.
Category: Architecture | Quality Impact: High
What This Looks Like
A developer changes a function in the order processing module. The test suite fails in the
reporting module, the notification service, and a batch job that nobody knew existed. The
developer did not touch any of those systems. They changed one function in one file, and three
unrelated features broke.
The team has learned to be cautious. Before making any change, developers trace every caller,
every import, and every database query that might be affected. A change that should take an hour
takes a day because most of the time is spent figuring out what might break. Even after that
analysis, surprises are common.
Common variations:
- The web of shared state. Multiple modules read and write the same database tables directly.
A schema change in one module breaks queries in five others. Nobody owns the tables because
everybody uses them.
- The god object. A single class or module that everything depends on. It handles
authentication, logging, database access, and business logic. Changing it is terrifying because
the entire application runs through it.
- Transitive dependency chains. Module A depends on Module B, which depends on Module C. A
change to Module C breaks Module A through a chain that nobody can trace without a debugger.
The dependency graph is a tangle, not a tree.
- Shared libraries with hidden contracts. Internal libraries used by multiple modules with no
versioning or API stability guarantees. Updating the library for one consumer breaks another.
Teams stop updating shared libraries because the risk is too high.
- Everything deploys together. The application is a single deployable unit. Even if modules
are logically separated in the source code, they compile and ship as one artifact. A one-line
change to the login page requires deploying the entire system.
The telltale sign: developers regularly say “I don’t know what this change will affect” and
mean it. Changes routinely break features that seem unrelated.
Why This Is a Problem
Tight coupling turns every change into a gamble. The cost of a change is not proportional to its
size but to the number of hidden dependencies it touches. Small changes carry large risk, which
slows everything down.
It reduces quality
When every change can break anything, developers cannot reason about the impact of their work.
A well-bounded module lets a developer think locally: “I changed the discount calculation, so
discount-related behavior might be affected.” A tightly coupled system offers no such guarantee.
The discount calculation might share a database table with the shipping module, which triggers
a notification workflow, which updates a dashboard.
This unpredictable blast radius makes code review less effective. Reviewers can verify that the
code in the diff is correct, but they cannot verify that it is safe. The breakage happens in code
that is not in the diff - code that neither the author nor the reviewer thought to check.
In a system with clear module boundaries, the blast radius of a change is bounded by the module’s
interface. If the interface does not change, nothing outside the module can break. Developers and
reviewers can focus on the module itself and trust the boundary.
It increases rework
Tight coupling causes rework in two ways. First, unexpected breakage from seemingly safe changes
sends developers back to fix things they did not intend to touch. A one-line change that breaks
the notification system means the developer now needs to understand and fix the notification
system before their original change can ship.
Second, developers working in different parts of the codebase step on each other. Two developers
changing different modules unknowingly modify the same shared state. Both changes work
individually but conflict when merged. The merge succeeds at the code level but fails at runtime
because the shared state cannot satisfy both changes simultaneously. These bugs are expensive to
find because the failure only manifests when both changes are present.
Systems with clear boundaries minimize this interference. Each module owns its data and exposes
it through explicit interfaces. Two developers working in different modules cannot create a
hidden conflict because there is no shared mutable state to conflict on.
It makes delivery timelines unpredictable
In a coupled system, the time to deliver a change includes the time to understand the impact,
make the change, fix the unexpected breakage, and retest everything that might be affected. The
first and third steps are unpredictable because no one knows the full dependency graph.
A developer estimates a task at two days. On day one, the change is made and tests are passing.
On day two, a failing test in another module reveals a hidden dependency. Fixing the dependency
takes two more days. The task that was estimated at two days takes four. This happens often enough
that the team stops trusting estimates, and stakeholders stop trusting timelines.
The testing cost is also unpredictable. In a modular system, changing Module A means running
Module A’s tests. In a coupled system, changing anything might mean running everything. If the
full test suite takes 30 minutes, every small change requires a 30-minute feedback cycle because
there is no way to scope the impact.
It prevents independent team ownership
When the codebase is a tangle of dependencies, no team can own a module cleanly. Every change in
one team’s area risks breaking another team’s area. Teams develop informal coordination rituals:
“Let us know before you change the order table.” “Don’t touch the shared utils module without
talking to Platform first.”
These coordination costs scale quadratically with the number of teams. Two teams need one
communication channel. Five teams need ten. Ten teams need forty-five. The result is that adding
developers makes the system slower to change, not faster.
In a system with well-defined module boundaries, each team owns their modules and their data.
They deploy independently. They do not need to coordinate on internal changes because the
boundaries prevent cross-module breakage. Communication focuses on interface changes, which are
infrequent and explicit.
Impact on continuous delivery
Continuous delivery requires that any change can flow from commit to production safely and
quickly. Tight coupling breaks this in multiple ways:
- Blast radius prevents small, safe changes. If a one-line change can break unrelated
features, no change is small from a risk perspective. The team compensates by batching changes
and testing extensively, which is the opposite of continuous.
- Testing scope is unbounded. Without module boundaries, there is no way to scope testing to
the changed area. Every change requires running the full suite, which slows the pipeline and
reduces deployment frequency.
- Independent deployment is impossible. If everything must deploy together, deployment
coordination is required. Teams queue up behind each other. Deployment frequency is limited by
the slowest team.
- Rollback is risky. Rolling back one change might break something else if other changes
were deployed simultaneously. The tangle works in both directions.
A team with a tightly coupled monolith can still practice CD, but they must invest in decoupling
first. Without boundaries, the feedback loops are too slow and the blast radius is too large for
continuous deployment to be safe.
How to Fix It
Decoupling a monolith is a long-term effort. The goal is not to rewrite the system or extract
microservices on day one. The goal is to create boundaries that limit blast radius and enable
independent change. Start where the pain is greatest.
Step 1: Map the dependency hotspots (Week 1)
Identify the areas of the codebase where coupling causes the most pain:
- Use version control history to find the files that change together most frequently. Files that
always change as a group are likely coupled.
- List the modules or components that are most often involved in unexpected test failures after
changes to other areas.
- Identify shared database tables - tables that are read or written by more than one module.
- Draw the dependency graph. Tools like dependency-cruiser (JavaScript), jdepend (Java), or
similar can automate this. Look for cycles and high fan-in nodes.
Rank the hotspots by pain: which coupling causes the most unexpected breakage, the most
coordination overhead, or the most test failures?
Step 2: Define module boundaries on paper (Week 2)
Before changing any code, define where boundaries should be:
- Group related functionality into candidate modules based on business domain, not technical
layer. “Orders,” “Payments,” and “Notifications” are better boundaries than “Database,”
“API,” and “UI.”
- For each boundary, define what the public interface would be: what data crosses the boundary
and in what format?
- Identify shared state that would need to be split or accessed through interfaces.
This is a design exercise, not an implementation. The output is a diagram showing target module
boundaries with their interfaces.
Step 3: Enforce one boundary (Weeks 3-6)
Pick the boundary with the best ratio of pain-reduced to effort-required and enforce it in code:
- Create an explicit interface (API, function contract, or event) for cross-module communication.
All external callers must use the interface.
- Move shared database access behind the interface. If the payments module needs order data, it
calls the orders module’s interface rather than querying the orders table directly.
- Add a build-time or lint-time check that enforces the boundary. Fail the build if code outside
the module imports internal code directly.
This is the hardest step because it requires changing existing call sites. Use the Strangler Fig
approach: create the new interface alongside the old coupling, migrate callers one at a time, and
remove the old path when all callers have migrated.
Step 4: Scope testing to module boundaries (Week 4+)
Once a boundary exists, use it to scope testing:
- Write tests for the module’s public interface (contract tests and functional tests).
- Changes within the module only need to run the module’s own tests plus the interface tests.
If the interface tests pass, nothing outside the module can break.
- Reserve the full integration suite for deployment validation, not developer feedback.
This immediately reduces pipeline duration for changes inside the bounded module. Developers get
faster feedback. The pipeline is no longer “run everything for every change.”
Step 5: Repeat for the next boundary (Ongoing)
Each new boundary reduces blast radius, improves test scoping, and enables more independent
ownership. Prioritize by pain:
| Signal |
What it tells you |
| Files that always change together across modules |
Coupling that forces coordinated changes |
| Unexpected test failures after unrelated changes |
Hidden dependencies through shared state |
| Multiple teams needing to coordinate on changes |
Ownership boundaries that do not match code boundaries |
| Long pipeline duration from running all tests |
No way to scope testing because boundaries do not exist |
Over months, the system evolves from a tangle into a set of modules with defined interfaces. This
is not a rewrite. It is incremental boundary enforcement applied where it matters most.
| Objection |
Response |
| “We should just rewrite it as microservices” |
A rewrite takes months or years and delivers zero value until it is finished. Enforcing boundaries in the existing codebase delivers value with each boundary and does not require a big-bang migration. |
| “We don’t have time to refactor” |
You are already paying the cost of coupling in unexpected breakage, slow testing, and coordination overhead. Each boundary you enforce reduces that ongoing cost. |
| “The coupling is too deep to untangle” |
Start with the easiest boundary, not the hardest. Even one well-enforced boundary reduces blast radius and proves the approach works. |
| “Module boundaries will slow us down” |
Boundaries add a small cost to cross-module changes and remove a large cost from within-module changes. Since most changes are within a module, the net effect is faster delivery. |
Measuring Progress
| Metric |
What to look for |
| Unexpected cross-module test failures |
Should decrease as boundaries are enforced |
| Change fail rate |
Should decrease as blast radius shrinks |
| Build duration |
Should decrease as testing can be scoped to affected modules |
| Development cycle time |
Should decrease as developers spend less time tracing dependencies |
| Cross-team coordination requests per sprint |
Should decrease as module ownership becomes clearer |
| Files changed per commit |
Should decrease as changes become more localized |
Related Content