This is the multi-page printable view of this section.
Click here to print.
Return to the regular view of this page.
Deployment and Release Problems
Symptoms related to deployment frequency, release risk, coordination overhead, and environment parity.
These symptoms indicate problems with your deployment and release process. When deploying is
painful, teams deploy less often, which increases batch size and risk. Each page describes what
you are seeing and links to the anti-patterns most likely causing it.
How to use this section
Start with the symptom that matches what your team experiences. Each symptom page explains what
you are seeing, identifies the most likely root causes (anti-patterns), and provides diagnostic
questions to narrow down which cause applies to your situation. Follow the anti-pattern link to
find concrete fix steps.
Related anti-pattern categories: Pipeline Anti-Patterns,
Architecture Anti-Patterns
Related guides: Pipeline Architecture,
Rollback,
Small Batches
1 - Multiple Services Must Be Deployed Together
Changes cannot go to production until multiple services are deployed in a specific order during a coordinated release window.
What you are seeing
A developer finishes a change to one service. It is tested, reviewed, and ready to deploy. But it
cannot go out alone. The change depends on a schema migration in a shared database, a new endpoint
in another service, and a UI update in a third. All three teams coordinate a release window.
Someone writes a deployment runbook with numbered steps. If step four fails, steps one through
three need to be rolled back manually.
The team cannot deploy on a Tuesday afternoon because the other teams are not ready. The change
sits in a branch (or merged to main but feature-flagged off) waiting for the coordinated release
next Thursday. By then, more changes have accumulated, making the release larger and riskier.
Common causes
Tightly Coupled Architecture
When services share a database, call each other without versioned contracts, or depend on
deployment order, they cannot be deployed independently. A change to Service A’s data model breaks
Service B if Service B is not updated at the same time. The architecture forces coordination
because the boundaries between services are not real boundaries. They are implementation details
that leak across service lines.
Read more: Tightly Coupled Monolith
Distributed Monolith
The organization moved from a monolith to services, but the service boundaries are wrong. Services
were decomposed along technical lines (a “database service,” an “auth service,” a “notification
service”) rather than along domain lines. The result is services that cannot handle a business
request on their own. Every user-facing operation requires a synchronous chain of calls across
multiple services. If one service in the chain is unavailable or deploying, the entire operation
fails.
This is a monolith distributed across the network. It has all the operational complexity of
microservices (network latency, partial failures, distributed debugging) with none of the
benefits (independent deployment, team autonomy, fault isolation). Deploying one service still
requires deploying the others because the boundaries do not correspond to independent units of
business functionality.
Read more: Distributed Monolith
Horizontal Slicing
When work for a feature is decomposed by service (“Team A builds the API, Team B updates the UI,
Team C modifies the processor”), each team’s change is incomplete on its own. Nothing is
deployable until all teams finish their part. The decomposition created the coordination
requirement. Vertical slicing within each team’s domain, with stable contracts between services,
allows each team to deploy when their slice is ready.
Read more: Horizontal Slicing
Undone Work
Sometimes the coordination requirement is artificial. The service could technically be deployed
independently, but the team’s definition of done requires a cross-service integration test that
only runs during the release window. Or deployment is gated on a manual approval from another
team. The coordination is not forced by the architecture but by process decisions that bundle
independent changes into a single release event.
Read more: Undone Work
How to narrow it down
- Do services share a database or call each other without versioned contracts? If yes, the
architecture forces coordination. Changes to shared state or unversioned interfaces cannot be
deployed independently. Start with
Tightly Coupled Monolith.
- Does every user-facing request require a synchronous chain across multiple services? If a
single business operation touches three or more services in sequence, the service boundaries
were drawn in the wrong place. You have a distributed monolith. Start with
Distributed Monolith.
- Was the feature decomposed by service or team rather than by behavior? If each team built
their piece of the feature independently and now all pieces must go out together, the work was
sliced horizontally. Start with
Horizontal Slicing.
- Could each service technically be deployed on its own, but process or policy prevents it?
If the coupling is in the release process (shared release window, cross-team sign-off, manual
integration test gate) rather than in the code, the constraint is organizational. Start with
Undone Work and examine whether the definition
of done requires unnecessary coordination.
Related Content
2 - The Team Is Afraid to Deploy
Production deployments cause anxiety because they frequently fail. The team delays deployments, which increases batch size, which increases risk.
What you are seeing
Nobody wants to deploy on a Friday. Or a Thursday. Ideally, deployments happen early in the week
when the team is available to respond to problems. The team has learned through experience that
deployments break things, so they treat each deployment as a high-risk event requiring maximum
staffing and attention.
Developers delay merging “risky” changes until after the next deploy so their code does not get
caught in the blast radius. Release managers add buffer time between deploys. The team informally
agrees on a deployment cadence (weekly, biweekly) that gives everyone time to recover between
releases.
The fear is rational. Deployments do break things. But the team’s response (deploy less often,
batch more changes, add more manual verification) makes each deployment larger, riskier, and more
likely to fail. The fear becomes self-reinforcing.
Common causes
Manual Deployments
When deployment requires human execution of steps, each deployment carries human error risk. The
team has experienced deployments where a step was missed, a script was run in the wrong order, or
a configuration was set incorrectly. The fear is not of the code but of the deployment process
itself. Automated deployments that execute the same steps identically every time eliminate the
process-level risk.
Read more: Manual Deployments
Missing Deployment Pipeline
When there is no automated path from commit to production, the team has no confidence that the
deployed artifact has been properly built and tested. Did someone run the tests? Are we deploying
the right version? Is this the same artifact that was tested in staging? Without a pipeline that
enforces these checks, every deployment requires the team to manually verify the prerequisites.
Read more: Missing Deployment Pipeline
Blind Operations
When the team cannot observe production health after a deployment, they have no way to know
quickly whether the deploy succeeded or failed. The fear is not just that something will break but
that they will not know it broke until a customer reports it. Monitoring and automated health
checks transform deployment from “deploy and hope” to “deploy and verify.”
Read more: Blind Operations
Manual Testing Only
When the team has no automated tests, they have no confidence that the code works before
deploying it. Manual testing provides some coverage, but it is never exhaustive, and the team
knows it. Every deployment carries the risk that an untested code path will fail in production. A
comprehensive automated test suite gives the team evidence that the code works, replacing hope
with confidence.
Read more: Manual Testing Only
Monolithic Work Items
When changes are large, each deployment carries more risk simply because more code is changing at
once. A deployment with 200 lines changed across 3 files is easy to reason about and easy to roll
back. A deployment with 5,000 lines changed across 40 files is unpredictable. Small, frequent
deployments reduce risk per deployment rather than accumulating it.
Read more: Monolithic Work Items
How to narrow it down
- Is the deployment process automated? If a human runs the deployment, the fear may be of the
process, not the code. Start with
Manual Deployments.
- Does the team have an automated pipeline from commit to production? If not, there is no
systematic guarantee that the right artifact with the right tests reaches production. Start with
Missing Deployment Pipeline.
- Can the team verify production health within minutes of deploying? If not, the fear
includes not knowing whether the deploy worked. Start with
Blind Operations.
- Does the team have automated tests that provide confidence before deploying? If not, the
fear is that untested code will break. Start with
Manual Testing Only.
- How many changes are in a typical deployment? If deployments are large batches, the risk
per deployment is high by construction. Start with
Monolithic Work Items.
Related Content
3 - Hardening Sprints Are Needed Before Every Release
The team dedicates one or more sprints after “feature complete” to stabilize code before it can be released.
What you are seeing
After the team finishes building features, nothing is ready to ship. A “hardening sprint” is
scheduled: one or more sprints dedicated to bug fixing, stabilization, and integration testing. No
new features are built during this period. The team knows from experience that the code is not
production-ready when development ends.
The hardening sprint finds bugs that were invisible during development. Integration issues surface
because components were built in isolation. Performance problems appear under realistic load. Edge
cases that nobody tested during development cause failures. The hardening sprint is not optional
because skipping it means shipping broken software.
The team treats this as normal. Planning includes hardening time by default. A project that takes
four sprints to build is planned as six: four for features, two for stabilization.
Common causes
Manual Testing Only
When the team has no automated test suite, quality verification happens manually at the end. The
hardening sprint is where manual testers find the defects that automated tests would have caught
during development. Without automated regression testing, every release requires a full manual
pass to verify nothing is broken.
Read more: Manual Testing Only
Inverted Test Pyramid
When most tests are slow end-to-end tests and few are unit tests, defects in business logic go
undetected until integration testing. The E2E tests are too slow to run continuously, so they run
at the end. The hardening sprint is when the team finally discovers what was broken all along.
Read more: Inverted Test Pyramid
Undone Work
When the team’s definition of done does not include deployment and verification, stories are
marked complete while hidden work remains. Testing, validation, and integration happen after the
story is “done.” The hardening sprint is where all that undone work gets finished.
Read more: Undone Work
Monolithic Work Items
When features are built as large, indivisible units, integration risk accumulates silently. Each
large feature is developed in relative isolation for weeks. The hardening sprint is the first time
all the pieces come together, and the integration pain is proportional to the batch size.
Read more: Monolithic Work Items
Pressure to Skip Testing
When management pressures the team to maximize feature output, testing is deferred to “later.”
The hardening sprint is that “later.” Testing was not skipped; it was moved to the end where it is
less effective, more expensive, and blocks the release.
Read more: Pressure to Skip Testing
How to narrow it down
- Does the team have automated tests that run on every commit? If not, the hardening sprint
is compensating for the lack of continuous quality verification. Start with
Manual Testing Only.
- Are most automated tests end-to-end or UI tests? If the test suite is slow and top-heavy,
defects are caught late because fast unit tests are missing. Start with
Inverted Test Pyramid.
- Does the team’s definition of done include deployment and verification? If stories are
“done” before they are tested and deployed, the hardening sprint finishes what “done” should
have included. Start with
Undone Work.
- How large are the typical work items? If features take weeks and integrate at the end, the
batch size creates the integration risk. Start with
Monolithic Work Items.
- Is there pressure to prioritize features over testing? If testing is consistently deferred
to hit deadlines, the hardening sprint absorbs the cost. Start with
Pressure to Skip Testing.
Related Content
4 - Releases Are Infrequent and Painful
Deploying happens monthly, quarterly, or less. Each release is a large, risky event that requires war rooms and weekend work.
What you are seeing
The team deploys once a month, once a quarter, or on some irregular cadence that nobody can
predict. Each release is a significant event. There is a release planning meeting, a deployment
runbook, a designated release manager, and often a war room during the actual deploy. People
cancel plans for release weekends.
Between releases, changes pile up. By the time the release goes out, it contains dozens or
hundreds of changes from multiple developers. Nobody can confidently say what is in the release
without checking a spreadsheet or release notes document. When something breaks in production, the
team spends hours narrowing down which of the many changes caused the problem.
The team wants to release more often but feels trapped. Each release is so painful that adding
more releases feels like adding more pain.
Common causes
Manual Deployments
When deployment requires a human to execute steps (SSH into servers, run scripts, click through a
console), the process is slow, error-prone, and dependent on specific people being available. The
cost of each deployment is high enough that the team batches changes to amortize it. The batch
grows, the risk grows, and the release becomes an event rather than a routine.
Read more: Manual Deployments
Missing Deployment Pipeline
When there is no automated path from commit to production, every release requires manual
coordination of builds, tests, and deployments. Without a pipeline, the team cannot deploy on
demand because the process itself does not exist in a repeatable form.
Read more: Missing Deployment Pipeline
CAB Gates
When every production change requires committee approval, the approval cadence sets the release
cadence. If the Change Advisory Board meets weekly, releases happen weekly at best. If the meeting
is biweekly, releases are biweekly. The team cannot deploy faster than the approval process
allows, regardless of technical capability.
Read more: CAB Gates
Monolithic Work Items
When work is not decomposed into small, independently deployable increments, each “feature” is a
large batch of changes that takes weeks to complete. The team cannot release until the feature is
done, and the feature is never done quickly because it was scoped too large. Small batches enable
frequent releases. Large batches force infrequent ones.
Read more: Monolithic Work Items
Manual Regression Testing Gates
When every release requires a manual test pass that takes days or weeks, the testing cadence
limits the release cadence. The team cannot release until QA finishes, and QA cannot finish faster
because the test suite is manual and grows with every feature.
Read more: Manual Regression Testing Gates
How to narrow it down
- Is the deployment process automated? If deploying requires human steps beyond pressing a
button, the process itself is the bottleneck. Start with
Manual Deployments.
- Does a pipeline exist that can take code from commit to production? If not, the team cannot
release on demand because the infrastructure does not exist. Start with
Missing Deployment Pipeline.
- Does a committee or approval board gate production changes? If releases wait for scheduled
approval meetings, the approval cadence is the constraint. Start with
CAB Gates.
- How large is the typical work item? If features take weeks and are delivered as single
units, the batch size is the constraint. Start with
Monolithic Work Items.
- Does a manual test pass gate every release? If QA takes days per release, the testing
process is the constraint. Start with
Manual Regression Testing Gates.
Related Content
5 - Merge Freezes Before Deployments
Developers announce merge freezes because the integration process is fragile. Deploying requires coordination in chat.
What you are seeing
A message appears in the team chat: “Please don’t merge to main, I’m about to deploy.” The
deployment process requires the main branch to be stable and unchanged for the duration of the
deploy. Any merge during that window could invalidate the tested artifact, break the build, or
create an inconsistent state between what was tested and what ships.
Other developers queue up their PRs and wait. If the deployment hits a problem, the freeze
extends. Sometimes the freeze lasts hours. In the worst cases, the team informally agrees on
“deployment windows” where merging is allowed at certain times and deployments happen at others.
The merge freeze is a coordination tax. Every deployment interrupts the entire team’s workflow.
Developers learn to time their merges around deploy schedules, adding mental overhead to routine
work.
Common causes
Manual Deployments
When deployment is a manual process (running scripts, clicking through UIs, executing a runbook),
the person deploying needs the environment to hold still. Any change to main during the deployment
window could mean the deployed artifact does not match what was tested. Automated deployments that
build, test, and deploy atomically eliminate this window because the pipeline handles the full
sequence without requiring a stable pause.
Read more: Manual Deployments
Integration Deferred
When the team does not have a reliable CI process, merging to main is itself risky. If the build
breaks after a merge, the deployment is blocked. The team freezes merges not just to protect the
deployment but because they lack confidence that any given merge will keep main green. If CI were
reliable, merging and deploying could happen concurrently because main would always be deployable.
Read more: Integration Deferred
Missing Deployment Pipeline
When there is no pipeline that takes a specific commit through build, test, and deploy as a single
atomic operation, the team must manually coordinate which commit gets deployed. A pipeline pins
the deployment to a specific artifact built from a specific commit. Without it, the team must
freeze merges to prevent the target from moving while they deploy.
Read more: Missing Deployment Pipeline
How to narrow it down
- Is the deployment process automated end-to-end? If a human executes deployment steps, the
freeze protects against variance in the manual process. Start with
Manual Deployments.
- Does the team trust that main is always deployable? If merges to main sometimes break the
build, the freeze protects against unreliable integration. Start with
Integration Deferred.
- Does the pipeline deploy a specific artifact from a specific commit? If there is no
pipeline that pins the deployment to an immutable artifact, the team must manually ensure the
target does not move. Start with
Missing Deployment Pipeline.
Related Content
6 - Staging Passes but Production Fails
Deployments pass every pre-production check but break when they reach production.
What you are seeing
Code passes tests, QA signs off, staging looks fine. Then the release
hits production and something breaks: a feature behaves differently, a dependent service times
out, or data that never appeared in staging triggers an unhandled edge case.
The team scrambles to roll back or hotfix. Confidence in the pipeline drops. People start adding
more manual verification steps, which slows delivery without actually preventing the next
surprise.
Common causes
Snowflake Environments
When each environment is configured by hand (or was set up once and has drifted since), staging
and production are never truly the same. Different library versions, different environment
variables, different network configurations. Code that works in one context silently fails in
another because the environments are only superficially similar.
Read more: Snowflake Environments
Blind Operations
Sometimes the problem is not that staging passes and production fails. It is that production
failures go undetected until a customer reports them. Without monitoring and alerting, the team
has no way to verify production health after a deploy. “It works in staging” becomes the only
signal, and production problems surface hours or days late.
Read more: Blind Operations
Tightly Coupled Monolith
Hidden dependencies between components mean that a change in one area affects behavior in
another. In staging, these interactions may behave differently because the data is smaller, the
load is lighter, or a dependent service is stubbed. In production, the full weight of real usage
exposes coupling the team did not know existed.
Read more: Tightly Coupled Monolith
Manual Deployments
When deployment involves human steps (running scripts by hand, clicking through a console,
copying files), the process is never identical twice. A step skipped in staging, an extra
configuration applied in production, a different order of operations. The deployment itself
becomes a source of variance between environments.
Read more: Manual Deployments
How to narrow it down
- Are your environments provisioned from the same infrastructure code? If not, or if you
are not sure, start with Snowflake Environments.
- How did you discover the production failure? If a customer or support team reported it
rather than an automated alert, start with
Blind Operations.
- Does the failure involve a different service or module than the one you changed? If yes,
the issue is likely hidden coupling. Start with
Tightly Coupled Monolith.
- Is the deployment process identical and automated across all environments? If not, start
with Manual Deployments.
Related Content