This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Deployment and Release Problems

Symptoms related to deployment frequency, release risk, coordination overhead, and environment parity.

These symptoms indicate problems with your deployment and release process. When deploying is painful, teams deploy less often, which increases batch size and risk. Each page describes what you are seeing and links to the anti-patterns most likely causing it.

How to use this section

Start with the symptom that matches what your team experiences. Each symptom page explains what you are seeing, identifies the most likely root causes (anti-patterns), and provides diagnostic questions to narrow down which cause applies to your situation. Follow the anti-pattern link to find concrete fix steps.

Related anti-pattern categories: Pipeline Anti-Patterns, Architecture Anti-Patterns

Related guides: Pipeline Architecture, Rollback, Small Batches

1 - Multiple Services Must Be Deployed Together

Changes cannot go to production until multiple services are deployed in a specific order during a coordinated release window.

What you are seeing

A developer finishes a change to one service. It is tested, reviewed, and ready to deploy. But it cannot go out alone. The change depends on a schema migration in a shared database, a new endpoint in another service, and a UI update in a third. All three teams coordinate a release window. Someone writes a deployment runbook with numbered steps. If step four fails, steps one through three need to be rolled back manually.

The team cannot deploy on a Tuesday afternoon because the other teams are not ready. The change sits in a branch (or merged to main but feature-flagged off) waiting for the coordinated release next Thursday. By then, more changes have accumulated, making the release larger and riskier.

Common causes

Tightly Coupled Architecture

When services share a database, call each other without versioned contracts, or depend on deployment order, they cannot be deployed independently. A change to Service A’s data model breaks Service B if Service B is not updated at the same time. The architecture forces coordination because the boundaries between services are not real boundaries. They are implementation details that leak across service lines.

Read more: Tightly Coupled Monolith

Distributed Monolith

The organization moved from a monolith to services, but the service boundaries are wrong. Services were decomposed along technical lines (a “database service,” an “auth service,” a “notification service”) rather than along domain lines. The result is services that cannot handle a business request on their own. Every user-facing operation requires a synchronous chain of calls across multiple services. If one service in the chain is unavailable or deploying, the entire operation fails.

This is a monolith distributed across the network. It has all the operational complexity of microservices (network latency, partial failures, distributed debugging) with none of the benefits (independent deployment, team autonomy, fault isolation). Deploying one service still requires deploying the others because the boundaries do not correspond to independent units of business functionality.

Read more: Distributed Monolith

Horizontal Slicing

When work for a feature is decomposed by service (“Team A builds the API, Team B updates the UI, Team C modifies the processor”), each team’s change is incomplete on its own. Nothing is deployable until all teams finish their part. The decomposition created the coordination requirement. Vertical slicing within each team’s domain, with stable contracts between services, allows each team to deploy when their slice is ready.

Read more: Horizontal Slicing

Undone Work

Sometimes the coordination requirement is artificial. The service could technically be deployed independently, but the team’s definition of done requires a cross-service integration test that only runs during the release window. Or deployment is gated on a manual approval from another team. The coordination is not forced by the architecture but by process decisions that bundle independent changes into a single release event.

Read more: Undone Work

How to narrow it down

  1. Do services share a database or call each other without versioned contracts? If yes, the architecture forces coordination. Changes to shared state or unversioned interfaces cannot be deployed independently. Start with Tightly Coupled Monolith.
  2. Does every user-facing request require a synchronous chain across multiple services? If a single business operation touches three or more services in sequence, the service boundaries were drawn in the wrong place. You have a distributed monolith. Start with Distributed Monolith.
  3. Was the feature decomposed by service or team rather than by behavior? If each team built their piece of the feature independently and now all pieces must go out together, the work was sliced horizontally. Start with Horizontal Slicing.
  4. Could each service technically be deployed on its own, but process or policy prevents it? If the coupling is in the release process (shared release window, cross-team sign-off, manual integration test gate) rather than in the code, the constraint is organizational. Start with Undone Work and examine whether the definition of done requires unnecessary coordination.

2 - The Team Is Afraid to Deploy

Production deployments cause anxiety because they frequently fail. The team delays deployments, which increases batch size, which increases risk.

What you are seeing

Nobody wants to deploy on a Friday. Or a Thursday. Ideally, deployments happen early in the week when the team is available to respond to problems. The team has learned through experience that deployments break things, so they treat each deployment as a high-risk event requiring maximum staffing and attention.

Developers delay merging “risky” changes until after the next deploy so their code does not get caught in the blast radius. Release managers add buffer time between deploys. The team informally agrees on a deployment cadence (weekly, biweekly) that gives everyone time to recover between releases.

The fear is rational. Deployments do break things. But the team’s response (deploy less often, batch more changes, add more manual verification) makes each deployment larger, riskier, and more likely to fail. The fear becomes self-reinforcing.

Common causes

Manual Deployments

When deployment requires human execution of steps, each deployment carries human error risk. The team has experienced deployments where a step was missed, a script was run in the wrong order, or a configuration was set incorrectly. The fear is not of the code but of the deployment process itself. Automated deployments that execute the same steps identically every time eliminate the process-level risk.

Read more: Manual Deployments

Missing Deployment Pipeline

When there is no automated path from commit to production, the team has no confidence that the deployed artifact has been properly built and tested. Did someone run the tests? Are we deploying the right version? Is this the same artifact that was tested in staging? Without a pipeline that enforces these checks, every deployment requires the team to manually verify the prerequisites.

Read more: Missing Deployment Pipeline

Blind Operations

When the team cannot observe production health after a deployment, they have no way to know quickly whether the deploy succeeded or failed. The fear is not just that something will break but that they will not know it broke until a customer reports it. Monitoring and automated health checks transform deployment from “deploy and hope” to “deploy and verify.”

Read more: Blind Operations

Manual Testing Only

When the team has no automated tests, they have no confidence that the code works before deploying it. Manual testing provides some coverage, but it is never exhaustive, and the team knows it. Every deployment carries the risk that an untested code path will fail in production. A comprehensive automated test suite gives the team evidence that the code works, replacing hope with confidence.

Read more: Manual Testing Only

Monolithic Work Items

When changes are large, each deployment carries more risk simply because more code is changing at once. A deployment with 200 lines changed across 3 files is easy to reason about and easy to roll back. A deployment with 5,000 lines changed across 40 files is unpredictable. Small, frequent deployments reduce risk per deployment rather than accumulating it.

Read more: Monolithic Work Items

How to narrow it down

  1. Is the deployment process automated? If a human runs the deployment, the fear may be of the process, not the code. Start with Manual Deployments.
  2. Does the team have an automated pipeline from commit to production? If not, there is no systematic guarantee that the right artifact with the right tests reaches production. Start with Missing Deployment Pipeline.
  3. Can the team verify production health within minutes of deploying? If not, the fear includes not knowing whether the deploy worked. Start with Blind Operations.
  4. Does the team have automated tests that provide confidence before deploying? If not, the fear is that untested code will break. Start with Manual Testing Only.
  5. How many changes are in a typical deployment? If deployments are large batches, the risk per deployment is high by construction. Start with Monolithic Work Items.

3 - Hardening Sprints Are Needed Before Every Release

The team dedicates one or more sprints after “feature complete” to stabilize code before it can be released.

What you are seeing

After the team finishes building features, nothing is ready to ship. A “hardening sprint” is scheduled: one or more sprints dedicated to bug fixing, stabilization, and integration testing. No new features are built during this period. The team knows from experience that the code is not production-ready when development ends.

The hardening sprint finds bugs that were invisible during development. Integration issues surface because components were built in isolation. Performance problems appear under realistic load. Edge cases that nobody tested during development cause failures. The hardening sprint is not optional because skipping it means shipping broken software.

The team treats this as normal. Planning includes hardening time by default. A project that takes four sprints to build is planned as six: four for features, two for stabilization.

Common causes

Manual Testing Only

When the team has no automated test suite, quality verification happens manually at the end. The hardening sprint is where manual testers find the defects that automated tests would have caught during development. Without automated regression testing, every release requires a full manual pass to verify nothing is broken.

Read more: Manual Testing Only

Inverted Test Pyramid

When most tests are slow end-to-end tests and few are unit tests, defects in business logic go undetected until integration testing. The E2E tests are too slow to run continuously, so they run at the end. The hardening sprint is when the team finally discovers what was broken all along.

Read more: Inverted Test Pyramid

Undone Work

When the team’s definition of done does not include deployment and verification, stories are marked complete while hidden work remains. Testing, validation, and integration happen after the story is “done.” The hardening sprint is where all that undone work gets finished.

Read more: Undone Work

Monolithic Work Items

When features are built as large, indivisible units, integration risk accumulates silently. Each large feature is developed in relative isolation for weeks. The hardening sprint is the first time all the pieces come together, and the integration pain is proportional to the batch size.

Read more: Monolithic Work Items

Pressure to Skip Testing

When management pressures the team to maximize feature output, testing is deferred to “later.” The hardening sprint is that “later.” Testing was not skipped; it was moved to the end where it is less effective, more expensive, and blocks the release.

Read more: Pressure to Skip Testing

How to narrow it down

  1. Does the team have automated tests that run on every commit? If not, the hardening sprint is compensating for the lack of continuous quality verification. Start with Manual Testing Only.
  2. Are most automated tests end-to-end or UI tests? If the test suite is slow and top-heavy, defects are caught late because fast unit tests are missing. Start with Inverted Test Pyramid.
  3. Does the team’s definition of done include deployment and verification? If stories are “done” before they are tested and deployed, the hardening sprint finishes what “done” should have included. Start with Undone Work.
  4. How large are the typical work items? If features take weeks and integrate at the end, the batch size creates the integration risk. Start with Monolithic Work Items.
  5. Is there pressure to prioritize features over testing? If testing is consistently deferred to hit deadlines, the hardening sprint absorbs the cost. Start with Pressure to Skip Testing.

4 - Releases Are Infrequent and Painful

Deploying happens monthly, quarterly, or less. Each release is a large, risky event that requires war rooms and weekend work.

What you are seeing

The team deploys once a month, once a quarter, or on some irregular cadence that nobody can predict. Each release is a significant event. There is a release planning meeting, a deployment runbook, a designated release manager, and often a war room during the actual deploy. People cancel plans for release weekends.

Between releases, changes pile up. By the time the release goes out, it contains dozens or hundreds of changes from multiple developers. Nobody can confidently say what is in the release without checking a spreadsheet or release notes document. When something breaks in production, the team spends hours narrowing down which of the many changes caused the problem.

The team wants to release more often but feels trapped. Each release is so painful that adding more releases feels like adding more pain.

Common causes

Manual Deployments

When deployment requires a human to execute steps (SSH into servers, run scripts, click through a console), the process is slow, error-prone, and dependent on specific people being available. The cost of each deployment is high enough that the team batches changes to amortize it. The batch grows, the risk grows, and the release becomes an event rather than a routine.

Read more: Manual Deployments

Missing Deployment Pipeline

When there is no automated path from commit to production, every release requires manual coordination of builds, tests, and deployments. Without a pipeline, the team cannot deploy on demand because the process itself does not exist in a repeatable form.

Read more: Missing Deployment Pipeline

CAB Gates

When every production change requires committee approval, the approval cadence sets the release cadence. If the Change Advisory Board meets weekly, releases happen weekly at best. If the meeting is biweekly, releases are biweekly. The team cannot deploy faster than the approval process allows, regardless of technical capability.

Read more: CAB Gates

Monolithic Work Items

When work is not decomposed into small, independently deployable increments, each “feature” is a large batch of changes that takes weeks to complete. The team cannot release until the feature is done, and the feature is never done quickly because it was scoped too large. Small batches enable frequent releases. Large batches force infrequent ones.

Read more: Monolithic Work Items

Manual Regression Testing Gates

When every release requires a manual test pass that takes days or weeks, the testing cadence limits the release cadence. The team cannot release until QA finishes, and QA cannot finish faster because the test suite is manual and grows with every feature.

Read more: Manual Regression Testing Gates

How to narrow it down

  1. Is the deployment process automated? If deploying requires human steps beyond pressing a button, the process itself is the bottleneck. Start with Manual Deployments.
  2. Does a pipeline exist that can take code from commit to production? If not, the team cannot release on demand because the infrastructure does not exist. Start with Missing Deployment Pipeline.
  3. Does a committee or approval board gate production changes? If releases wait for scheduled approval meetings, the approval cadence is the constraint. Start with CAB Gates.
  4. How large is the typical work item? If features take weeks and are delivered as single units, the batch size is the constraint. Start with Monolithic Work Items.
  5. Does a manual test pass gate every release? If QA takes days per release, the testing process is the constraint. Start with Manual Regression Testing Gates.

5 - Merge Freezes Before Deployments

Developers announce merge freezes because the integration process is fragile. Deploying requires coordination in chat.

What you are seeing

A message appears in the team chat: “Please don’t merge to main, I’m about to deploy.” The deployment process requires the main branch to be stable and unchanged for the duration of the deploy. Any merge during that window could invalidate the tested artifact, break the build, or create an inconsistent state between what was tested and what ships.

Other developers queue up their PRs and wait. If the deployment hits a problem, the freeze extends. Sometimes the freeze lasts hours. In the worst cases, the team informally agrees on “deployment windows” where merging is allowed at certain times and deployments happen at others.

The merge freeze is a coordination tax. Every deployment interrupts the entire team’s workflow. Developers learn to time their merges around deploy schedules, adding mental overhead to routine work.

Common causes

Manual Deployments

When deployment is a manual process (running scripts, clicking through UIs, executing a runbook), the person deploying needs the environment to hold still. Any change to main during the deployment window could mean the deployed artifact does not match what was tested. Automated deployments that build, test, and deploy atomically eliminate this window because the pipeline handles the full sequence without requiring a stable pause.

Read more: Manual Deployments

Integration Deferred

When the team does not have a reliable CI process, merging to main is itself risky. If the build breaks after a merge, the deployment is blocked. The team freezes merges not just to protect the deployment but because they lack confidence that any given merge will keep main green. If CI were reliable, merging and deploying could happen concurrently because main would always be deployable.

Read more: Integration Deferred

Missing Deployment Pipeline

When there is no pipeline that takes a specific commit through build, test, and deploy as a single atomic operation, the team must manually coordinate which commit gets deployed. A pipeline pins the deployment to a specific artifact built from a specific commit. Without it, the team must freeze merges to prevent the target from moving while they deploy.

Read more: Missing Deployment Pipeline

How to narrow it down

  1. Is the deployment process automated end-to-end? If a human executes deployment steps, the freeze protects against variance in the manual process. Start with Manual Deployments.
  2. Does the team trust that main is always deployable? If merges to main sometimes break the build, the freeze protects against unreliable integration. Start with Integration Deferred.
  3. Does the pipeline deploy a specific artifact from a specific commit? If there is no pipeline that pins the deployment to an immutable artifact, the team must manually ensure the target does not move. Start with Missing Deployment Pipeline.

6 - Staging Passes but Production Fails

Deployments pass every pre-production check but break when they reach production.

What you are seeing

Code passes tests, QA signs off, staging looks fine. Then the release hits production and something breaks: a feature behaves differently, a dependent service times out, or data that never appeared in staging triggers an unhandled edge case.

The team scrambles to roll back or hotfix. Confidence in the pipeline drops. People start adding more manual verification steps, which slows delivery without actually preventing the next surprise.

Common causes

Snowflake Environments

When each environment is configured by hand (or was set up once and has drifted since), staging and production are never truly the same. Different library versions, different environment variables, different network configurations. Code that works in one context silently fails in another because the environments are only superficially similar.

Read more: Snowflake Environments

Blind Operations

Sometimes the problem is not that staging passes and production fails. It is that production failures go undetected until a customer reports them. Without monitoring and alerting, the team has no way to verify production health after a deploy. “It works in staging” becomes the only signal, and production problems surface hours or days late.

Read more: Blind Operations

Tightly Coupled Monolith

Hidden dependencies between components mean that a change in one area affects behavior in another. In staging, these interactions may behave differently because the data is smaller, the load is lighter, or a dependent service is stubbed. In production, the full weight of real usage exposes coupling the team did not know existed.

Read more: Tightly Coupled Monolith

Manual Deployments

When deployment involves human steps (running scripts by hand, clicking through a console, copying files), the process is never identical twice. A step skipped in staging, an extra configuration applied in production, a different order of operations. The deployment itself becomes a source of variance between environments.

Read more: Manual Deployments

How to narrow it down

  1. Are your environments provisioned from the same infrastructure code? If not, or if you are not sure, start with Snowflake Environments.
  2. How did you discover the production failure? If a customer or support team reported it rather than an automated alert, start with Blind Operations.
  3. Does the failure involve a different service or module than the one you changed? If yes, the issue is likely hidden coupling. Start with Tightly Coupled Monolith.
  4. Is the deployment process identical and automated across all environments? If not, start with Manual Deployments.