This is the multi-page printable view of this section.
Click here to print.
Return to the regular view of this page.
Dysfunction Symptoms
Start from what you observe. Find the anti-patterns causing it.
Not sure which anti-pattern is hurting your team? Start here. Choose the path that fits how you
want to explore.
Find your symptom
Answer a few questions to narrow down which symptoms match your situation.
Start the triage questions
Browse by category
Jump directly to the area where you are experiencing problems.
- Test Suite Problems - Flaky tests, slow suites, high coverage that misses defects, environment-dependent failures
- Deployment and Release Problems - Fear of deploying, infrequent releases, coordinated deployments, merge freezes, hardening sprints
- Integration and Feedback Problems - Too much WIP, long cycle times, review bottlenecks, painful merges, slow feedback loops
- Production Visibility and Team Health - Customers finding bugs first, slow incident detection, environment drift, team burnout
Start from your role
Each role sees different symptoms first. Find the ones most relevant to your daily work.
- For Developers - Symptoms you hit while writing, testing, and shipping code - from flaky tests to painful merges
- For Managers - Symptoms that show up as unpredictable delivery, quality gaps, and team health problems
Explore by theme
Symptoms and anti-patterns share common themes. Browse by tag to see connections across categories.
View all tags
1 - Test Suite Problems
Symptoms related to test reliability, coverage effectiveness, speed, and environment consistency.
These symptoms indicate problems with your testing strategy. Unreliable or slow tests erode
confidence and slow delivery. Each page describes what you are seeing and links to the
anti-patterns most likely causing it.
How to use this section
Start with the symptom that matches what your team experiences. Each symptom page explains what
you are seeing, identifies the most likely root causes (anti-patterns), and provides diagnostic
questions to narrow down which cause applies to your situation. Follow the anti-pattern link to
find concrete fix steps.
Related anti-pattern categories: Testing Anti-Patterns,
Pipeline Anti-Patterns
Related guide: Testing Fundamentals
1.1 - Tests Pass in One Environment but Fail in Another
Tests pass locally but fail in CI, or pass in CI but fail in staging. Environment differences cause unpredictable failures.
What you are seeing
A developer runs the tests locally and they pass. They push to CI and the same tests fail. Or the
CI pipeline is green but the tests fail in the staging environment. The failures are not caused by
a code defect. They are caused by differences between environments: a different OS version, a
different database version, a different timezone setting, a missing environment variable, or a
service that is available locally but not in CI.
The developer spends time debugging the failure and discovers the root cause is environmental, not
logical. They add a workaround (skip the test in CI, add an environment check, adjust a timeout)
and move on. The workaround accumulates over time. The test suite becomes littered with
environment-specific conditionals and skipped tests.
The team loses confidence in the test suite because results depend on where the tests run rather
than whether the code is correct.
Common causes
Snowflake Environments
When each environment is configured by hand and maintained independently, they drift apart over
time. The developer’s laptop has one version of a database driver. The CI server has another. The
staging environment has a third. These differences are invisible until a test exercises a code
path that behaves differently across versions. The fix is not to harmonize configurations manually
(they will drift again) but to provision all environments from the same infrastructure code.
Read more: Snowflake Environments
Manual Deployments
When deployment and environment setup are manual processes, subtle differences creep in. One
developer installed a dependency a particular way. The CI server was configured by a different
person with slightly different settings. The staging environment was set up months ago and has not
been updated. Manual processes are never identical twice, and the variance causes environment-
dependent behavior.
Read more: Manual Deployments
Tightly Coupled Monolith
When the application has hidden dependencies on external state (filesystem paths, network
services, system configuration), tests that work in one environment fail in another because the
external state differs. Well-isolated code with explicit dependencies is portable across
environments. Tightly coupled code that reaches into its environment for implicit dependencies is
fragile.
Read more: Tightly Coupled Monolith
How to narrow it down
- Are all environments provisioned from the same infrastructure code? If not, environment
drift is the most likely cause. Start with
Snowflake Environments.
- Are environment setup and configuration manual? If different people configured different
environments, the variance is a direct result of manual processes. Start with
Manual Deployments.
- Do the failing tests depend on external services, filesystem paths, or system
configuration? If tests assume specific external state rather than declaring explicit
dependencies, the code’s coupling to its environment is the issue. Start with
Tightly Coupled Monolith.
Related Content
1.2 - High Coverage but Tests Miss Defects
Test coverage numbers look healthy but defects still reach production.
What you are seeing
Your dashboard shows 80% or 90% code coverage, but bugs keep getting through. Defects show up
in production that feel like they should have been caught. The team points to the coverage
number as proof that testing is solid, yet the results tell a different story.
People start losing trust in the test suite. Some developers stop running tests locally because
they do not believe the tests will catch anything useful. Others add more tests, pushing
coverage higher, without the defect rate improving.
Common causes
Inverted Test Pyramid
When most of your tests are end-to-end or integration tests, they exercise many code paths in a
single run - which inflates coverage numbers. But these tests often verify that a workflow
completes without errors, not that each piece of logic produces the correct result. A test that
clicks through a form and checks for a success message covers dozens of functions without
validating any of them in detail.
Read more: Inverted Test Pyramid
Pressure to Skip Testing
When teams face pressure to hit a coverage target, testing becomes theater. Developers write
tests with trivial assertions - checking that a function returns without throwing, or that a
value is not null - just to get the number up. The coverage metric looks healthy, but the tests
do not actually verify behavior. They exist to satisfy a gate, not to catch defects.
Read more: Pressure to Skip Testing
Code Coverage Mandates
When the organization gates the pipeline on a coverage target, teams optimize for the number
rather than for defect detection. Developers write assertion-free tests, cover trivial code, or
add single integration tests that execute hundreds of lines without validating any of them. The
coverage metric rises while the tests remain unable to catch meaningful defects.
Read more: Code Coverage Mandates
Manual Testing Only
When test automation is absent or minimal, teams sometimes generate superficial tests or rely on
coverage from integration-level runs that touch many lines without asserting meaningful outcomes.
The coverage tool counts every line that executes, regardless of whether any test validates the
result.
Read more: Manual Testing Only
How to narrow it down
- Do most tests assert on behavior and expected outcomes, or do they just verify that code
runs without errors? If tests mostly check for no-exceptions or non-null returns, the
problem is testing theater - tests written to hit a number, not to catch defects. Start with
Pressure to Skip Testing.
- Are the majority of your tests end-to-end or integration tests? If most of the suite runs
through a browser, API, or multi-service flow rather than testing units of logic directly,
start with Inverted Test Pyramid.
- Does the pipeline gate on a specific coverage percentage? If the team writes tests
primarily to keep coverage above a mandated threshold, start with
Code Coverage Mandates.
- Were tests added retroactively to meet a coverage target? If the bulk of tests were
written after the code to satisfy a coverage gate rather than to verify design decisions,
start with
Pressure to Skip Testing.
Related Content
1.3 - Refactoring Breaks Tests
Internal code changes that do not alter behavior cause widespread test failures.
What you are seeing
A developer renames a method, extracts a class, or reorganizes modules - changes that should not
affect external behavior. But dozens of tests fail. The failures are not catching real bugs.
They are breaking because the tests depend on implementation details that changed.
Developers start avoiding refactoring because the cost of updating tests is too high. Code
quality degrades over time because cleanup work is too expensive. When someone does refactor,
they spend more time fixing tests than improving the code.
Common causes
Inverted Test Pyramid
When the test suite is dominated by end-to-end and integration tests, those tests tend to be
tightly coupled to implementation details - CSS selectors, API response shapes, DOM structure,
or specific sequences of internal calls. A refactoring that changes none of the observable
behavior still breaks these tests because they assert on how the system works rather than what
it does.
Unit tests focused on behavior (“given this input, expect this output”) survive refactoring.
Tests coupled to implementation (“this method was called with these arguments”) do not.
Read more: Inverted Test Pyramid
Tightly Coupled Monolith
When components lack clear interfaces, tests reach into the internals of other modules. A
refactoring in module A breaks tests for module B - not because B’s behavior changed, but
because B’s tests were calling A’s internal methods directly. Without well-defined boundaries,
every internal change ripples across the test suite.
Read more: Tightly Coupled Monolith
How to narrow it down
- Do the broken tests assert on internal method calls, mock interactions, or DOM structure?
If yes, the tests are coupled to implementation rather than behavior. This is a test design
issue - start with Inverted Test Pyramid for guidance
on building a behavior-focused test suite.
- Are the broken tests end-to-end or UI tests that fail because of layout or selector
changes? If yes, you have too many tests at the wrong level of the pyramid. Start with
Inverted Test Pyramid.
- Do the broken tests span multiple modules - testing code in one area but breaking because
of changes in another? If yes, the problem is missing boundaries between components. Start
with Tightly Coupled Monolith.
Related Content
1.4 - Test Suite Is Too Slow to Run
The test suite takes 30 minutes or more. Developers stop running it locally and push without verifying.
What you are seeing
The full test suite takes 30 minutes, an hour, or longer. Developers do not run it locally because
they cannot afford to wait. Instead, they push their changes and let CI run the tests. Feedback
arrives long after the developer has moved on. If a test fails, the developer must context-switch
back, recall what they were doing, and debug the failure.
Some developers run only a subset of tests locally (the ones for their module) and skip the rest.
This catches some issues but misses integration problems between modules. Others skip local testing
entirely and treat the CI pipeline as their test runner, which overloads the shared pipeline and
increases wait times for everyone.
The team has discussed parallelizing the tests, splitting the suite, or adding more CI capacity.
These discussions stall because the root cause is not infrastructure. It is the shape of the test
suite itself.
Common causes
Inverted Test Pyramid
When the majority of tests are end-to-end or integration tests, the suite is inherently slow. E2E
tests launch browsers, start services, make network calls, and wait for responses. Each test takes
seconds or minutes instead of milliseconds. A suite of 500 E2E tests will always be slower than a
suite of 5,000 unit tests that verify the same logic at a lower level. The fix is not faster
hardware. It is moving test coverage down the pyramid.
Read more: Inverted Test Pyramid
Tightly Coupled Monolith
When the codebase has no clear module boundaries, tests cannot be scoped to individual components.
A test for one feature must set up the entire application because the feature depends on
everything. Test setup and teardown dominate execution time because there is no way to isolate the
system under test.
Read more: Tightly Coupled Monolith
Manual Testing Only
Sometimes the test suite is slow because the team added automated tests as an afterthought, using
E2E tests to backfill coverage for code that was not designed for unit testing. The resulting suite
is a collection of heavyweight tests that exercise the full stack for every scenario because the
code provides no lower-level testing seams.
Read more: Manual Testing Only
How to narrow it down
- What is the ratio of unit tests to E2E/integration tests? If E2E tests outnumber unit
tests, the test pyramid is inverted and the suite is slow by design. Start with
Inverted Test Pyramid.
- Can tests be run for a single module in isolation? If running one module’s tests requires
starting the entire application, the architecture prevents test isolation. Start with
Tightly Coupled Monolith.
- Were the automated tests added retroactively to a codebase with no testing seams? If tests
were bolted on after the fact using E2E tests because the code cannot be unit-tested, the
codebase needs refactoring for testability. Start with
Manual Testing Only.
Related Content
1.5 - Tests Randomly Pass or Fail
The pipeline fails, the developer reruns it without changing anything, and it passes.
What you are seeing
A developer pushes a change. The pipeline fails on a test they did not touch, in a module they
did not change. They click rerun. It passes. They merge. This happens multiple times a day across
the team. Nobody investigates failures on the first occurrence because the odds favor flakiness
over a real problem.
The team has adapted: retry-until-green is a routine step, not an exception. Some pipelines are
configured to automatically rerun failed tests. Tests are tagged as “known flaky” and skipped.
Real regressions hide behind the noise because the team has been trained to ignore failures.
Common causes
Inverted Test Pyramid
When the test suite is dominated by end-to-end tests, flakiness is structural. E2E tests depend
on network connectivity, shared test environments, external service availability, and browser
rendering timing. Any of these can produce a different result on each run. A suite built mostly
on E2E tests will always be flaky because it is built on non-deterministic foundations.
Replacing E2E tests with functional tests that use test doubles for external dependencies makes
the suite deterministic by design. The test produces the same result every time because it
controls all its inputs.
Read more: Inverted Test Pyramid
Snowflake Environments
When the CI environment is configured differently from other environments - or drifts over time -
tests pass locally but fail in CI, or pass in CI on Tuesday but fail on Wednesday. The
inconsistency is not in the test or the code but in the environment the test runs in.
Tests that depend on specific environment configurations, installed packages, file system layout,
or network access are vulnerable to environment drift. Infrastructure-as-code eliminates this
class of flakiness by ensuring environments are identical and reproducible.
Read more: Snowflake Environments
Tightly Coupled Monolith
When components share mutable state - a database, a cache, a filesystem directory - tests that
run concurrently or in a specific order can interfere with each other. Test A writes to a shared
table. Test B reads from the same table and gets unexpected data. The tests pass individually
but fail together, or pass in one order but fail in another.
Without clear component boundaries, tests cannot be isolated. The flakiness is a symptom of
architectural coupling, not a testing problem.
Read more: Tightly Coupled Monolith
How to narrow it down
- Do the flaky tests hit real external services or shared environments? If yes, the tests
are non-deterministic by design. Start with
Inverted Test Pyramid and replace them with
functional tests using test doubles.
- Do tests pass locally but fail in CI, or vice versa? If yes, the environments differ.
Start with Snowflake Environments.
- Do tests pass individually but fail when run together, or fail in a different order? If
yes, tests share mutable state. Start with
Tightly Coupled Monolith for the
architectural root cause, and isolate test data as an immediate fix.
Related Content
2 - Deployment and Release Problems
Symptoms related to deployment frequency, release risk, coordination overhead, and environment parity.
These symptoms indicate problems with your deployment and release process. When deploying is
painful, teams deploy less often, which increases batch size and risk. Each page describes what
you are seeing and links to the anti-patterns most likely causing it.
How to use this section
Start with the symptom that matches what your team experiences. Each symptom page explains what
you are seeing, identifies the most likely root causes (anti-patterns), and provides diagnostic
questions to narrow down which cause applies to your situation. Follow the anti-pattern link to
find concrete fix steps.
Related anti-pattern categories: Pipeline Anti-Patterns,
Architecture Anti-Patterns
Related guides: Pipeline Architecture,
Rollback,
Small Batches
2.1 - Multiple Services Must Be Deployed Together
Changes cannot go to production until multiple services are deployed in a specific order during a coordinated release window.
What you are seeing
A developer finishes a change to one service. It is tested, reviewed, and ready to deploy. But it
cannot go out alone. The change depends on a schema migration in a shared database, a new endpoint
in another service, and a UI update in a third. All three teams coordinate a release window.
Someone writes a deployment runbook with numbered steps. If step four fails, steps one through
three need to be rolled back manually.
The team cannot deploy on a Tuesday afternoon because the other teams are not ready. The change
sits in a branch (or merged to main but feature-flagged off) waiting for the coordinated release
next Thursday. By then, more changes have accumulated, making the release larger and riskier.
Common causes
Tightly Coupled Architecture
When services share a database, call each other without versioned contracts, or depend on
deployment order, they cannot be deployed independently. A change to Service A’s data model breaks
Service B if Service B is not updated at the same time. The architecture forces coordination
because the boundaries between services are not real boundaries. They are implementation details
that leak across service lines.
Read more: Tightly Coupled Monolith
Distributed Monolith
The organization moved from a monolith to services, but the service boundaries are wrong. Services
were decomposed along technical lines (a “database service,” an “auth service,” a “notification
service”) rather than along domain lines. The result is services that cannot handle a business
request on their own. Every user-facing operation requires a synchronous chain of calls across
multiple services. If one service in the chain is unavailable or deploying, the entire operation
fails.
This is a monolith distributed across the network. It has all the operational complexity of
microservices (network latency, partial failures, distributed debugging) with none of the
benefits (independent deployment, team autonomy, fault isolation). Deploying one service still
requires deploying the others because the boundaries do not correspond to independent units of
business functionality.
Read more: Distributed Monolith
Horizontal Slicing
When work for a feature is decomposed by service (“Team A builds the API, Team B updates the UI,
Team C modifies the processor”), each team’s change is incomplete on its own. Nothing is
deployable until all teams finish their part. The decomposition created the coordination
requirement. Vertical slicing within each team’s domain, with stable contracts between services,
allows each team to deploy when their slice is ready.
Read more: Horizontal Slicing
Undone Work
Sometimes the coordination requirement is artificial. The service could technically be deployed
independently, but the team’s definition of done requires a cross-service integration test that
only runs during the release window. Or deployment is gated on a manual approval from another
team. The coordination is not forced by the architecture but by process decisions that bundle
independent changes into a single release event.
Read more: Undone Work
How to narrow it down
- Do services share a database or call each other without versioned contracts? If yes, the
architecture forces coordination. Changes to shared state or unversioned interfaces cannot be
deployed independently. Start with
Tightly Coupled Monolith.
- Does every user-facing request require a synchronous chain across multiple services? If a
single business operation touches three or more services in sequence, the service boundaries
were drawn in the wrong place. You have a distributed monolith. Start with
Distributed Monolith.
- Was the feature decomposed by service or team rather than by behavior? If each team built
their piece of the feature independently and now all pieces must go out together, the work was
sliced horizontally. Start with
Horizontal Slicing.
- Could each service technically be deployed on its own, but process or policy prevents it?
If the coupling is in the release process (shared release window, cross-team sign-off, manual
integration test gate) rather than in the code, the constraint is organizational. Start with
Undone Work and examine whether the definition
of done requires unnecessary coordination.
Related Content
2.2 - The Team Is Afraid to Deploy
Production deployments cause anxiety because they frequently fail. The team delays deployments, which increases batch size, which increases risk.
What you are seeing
Nobody wants to deploy on a Friday. Or a Thursday. Ideally, deployments happen early in the week
when the team is available to respond to problems. The team has learned through experience that
deployments break things, so they treat each deployment as a high-risk event requiring maximum
staffing and attention.
Developers delay merging “risky” changes until after the next deploy so their code does not get
caught in the blast radius. Release managers add buffer time between deploys. The team informally
agrees on a deployment cadence (weekly, biweekly) that gives everyone time to recover between
releases.
The fear is rational. Deployments do break things. But the team’s response (deploy less often,
batch more changes, add more manual verification) makes each deployment larger, riskier, and more
likely to fail. The fear becomes self-reinforcing.
Common causes
Manual Deployments
When deployment requires human execution of steps, each deployment carries human error risk. The
team has experienced deployments where a step was missed, a script was run in the wrong order, or
a configuration was set incorrectly. The fear is not of the code but of the deployment process
itself. Automated deployments that execute the same steps identically every time eliminate the
process-level risk.
Read more: Manual Deployments
Missing Deployment Pipeline
When there is no automated path from commit to production, the team has no confidence that the
deployed artifact has been properly built and tested. Did someone run the tests? Are we deploying
the right version? Is this the same artifact that was tested in staging? Without a pipeline that
enforces these checks, every deployment requires the team to manually verify the prerequisites.
Read more: Missing Deployment Pipeline
Blind Operations
When the team cannot observe production health after a deployment, they have no way to know
quickly whether the deploy succeeded or failed. The fear is not just that something will break but
that they will not know it broke until a customer reports it. Monitoring and automated health
checks transform deployment from “deploy and hope” to “deploy and verify.”
Read more: Blind Operations
Manual Testing Only
When the team has no automated tests, they have no confidence that the code works before
deploying it. Manual testing provides some coverage, but it is never exhaustive, and the team
knows it. Every deployment carries the risk that an untested code path will fail in production. A
comprehensive automated test suite gives the team evidence that the code works, replacing hope
with confidence.
Read more: Manual Testing Only
Monolithic Work Items
When changes are large, each deployment carries more risk simply because more code is changing at
once. A deployment with 200 lines changed across 3 files is easy to reason about and easy to roll
back. A deployment with 5,000 lines changed across 40 files is unpredictable. Small, frequent
deployments reduce risk per deployment rather than accumulating it.
Read more: Monolithic Work Items
How to narrow it down
- Is the deployment process automated? If a human runs the deployment, the fear may be of the
process, not the code. Start with
Manual Deployments.
- Does the team have an automated pipeline from commit to production? If not, there is no
systematic guarantee that the right artifact with the right tests reaches production. Start with
Missing Deployment Pipeline.
- Can the team verify production health within minutes of deploying? If not, the fear
includes not knowing whether the deploy worked. Start with
Blind Operations.
- Does the team have automated tests that provide confidence before deploying? If not, the
fear is that untested code will break. Start with
Manual Testing Only.
- How many changes are in a typical deployment? If deployments are large batches, the risk
per deployment is high by construction. Start with
Monolithic Work Items.
Related Content
2.3 - Hardening Sprints Are Needed Before Every Release
The team dedicates one or more sprints after “feature complete” to stabilize code before it can be released.
What you are seeing
After the team finishes building features, nothing is ready to ship. A “hardening sprint” is
scheduled: one or more sprints dedicated to bug fixing, stabilization, and integration testing. No
new features are built during this period. The team knows from experience that the code is not
production-ready when development ends.
The hardening sprint finds bugs that were invisible during development. Integration issues surface
because components were built in isolation. Performance problems appear under realistic load. Edge
cases that nobody tested during development cause failures. The hardening sprint is not optional
because skipping it means shipping broken software.
The team treats this as normal. Planning includes hardening time by default. A project that takes
four sprints to build is planned as six: four for features, two for stabilization.
Common causes
Manual Testing Only
When the team has no automated test suite, quality verification happens manually at the end. The
hardening sprint is where manual testers find the defects that automated tests would have caught
during development. Without automated regression testing, every release requires a full manual
pass to verify nothing is broken.
Read more: Manual Testing Only
Inverted Test Pyramid
When most tests are slow end-to-end tests and few are unit tests, defects in business logic go
undetected until integration testing. The E2E tests are too slow to run continuously, so they run
at the end. The hardening sprint is when the team finally discovers what was broken all along.
Read more: Inverted Test Pyramid
Undone Work
When the team’s definition of done does not include deployment and verification, stories are
marked complete while hidden work remains. Testing, validation, and integration happen after the
story is “done.” The hardening sprint is where all that undone work gets finished.
Read more: Undone Work
Monolithic Work Items
When features are built as large, indivisible units, integration risk accumulates silently. Each
large feature is developed in relative isolation for weeks. The hardening sprint is the first time
all the pieces come together, and the integration pain is proportional to the batch size.
Read more: Monolithic Work Items
Pressure to Skip Testing
When management pressures the team to maximize feature output, testing is deferred to “later.”
The hardening sprint is that “later.” Testing was not skipped; it was moved to the end where it is
less effective, more expensive, and blocks the release.
Read more: Pressure to Skip Testing
How to narrow it down
- Does the team have automated tests that run on every commit? If not, the hardening sprint
is compensating for the lack of continuous quality verification. Start with
Manual Testing Only.
- Are most automated tests end-to-end or UI tests? If the test suite is slow and top-heavy,
defects are caught late because fast unit tests are missing. Start with
Inverted Test Pyramid.
- Does the team’s definition of done include deployment and verification? If stories are
“done” before they are tested and deployed, the hardening sprint finishes what “done” should
have included. Start with
Undone Work.
- How large are the typical work items? If features take weeks and integrate at the end, the
batch size creates the integration risk. Start with
Monolithic Work Items.
- Is there pressure to prioritize features over testing? If testing is consistently deferred
to hit deadlines, the hardening sprint absorbs the cost. Start with
Pressure to Skip Testing.
Related Content
2.4 - Releases Are Infrequent and Painful
Deploying happens monthly, quarterly, or less. Each release is a large, risky event that requires war rooms and weekend work.
What you are seeing
The team deploys once a month, once a quarter, or on some irregular cadence that nobody can
predict. Each release is a significant event. There is a release planning meeting, a deployment
runbook, a designated release manager, and often a war room during the actual deploy. People
cancel plans for release weekends.
Between releases, changes pile up. By the time the release goes out, it contains dozens or
hundreds of changes from multiple developers. Nobody can confidently say what is in the release
without checking a spreadsheet or release notes document. When something breaks in production, the
team spends hours narrowing down which of the many changes caused the problem.
The team wants to release more often but feels trapped. Each release is so painful that adding
more releases feels like adding more pain.
Common causes
Manual Deployments
When deployment requires a human to execute steps (SSH into servers, run scripts, click through a
console), the process is slow, error-prone, and dependent on specific people being available. The
cost of each deployment is high enough that the team batches changes to amortize it. The batch
grows, the risk grows, and the release becomes an event rather than a routine.
Read more: Manual Deployments
Missing Deployment Pipeline
When there is no automated path from commit to production, every release requires manual
coordination of builds, tests, and deployments. Without a pipeline, the team cannot deploy on
demand because the process itself does not exist in a repeatable form.
Read more: Missing Deployment Pipeline
CAB Gates
When every production change requires committee approval, the approval cadence sets the release
cadence. If the Change Advisory Board meets weekly, releases happen weekly at best. If the meeting
is biweekly, releases are biweekly. The team cannot deploy faster than the approval process
allows, regardless of technical capability.
Read more: CAB Gates
Monolithic Work Items
When work is not decomposed into small, independently deployable increments, each “feature” is a
large batch of changes that takes weeks to complete. The team cannot release until the feature is
done, and the feature is never done quickly because it was scoped too large. Small batches enable
frequent releases. Large batches force infrequent ones.
Read more: Monolithic Work Items
Manual Regression Testing Gates
When every release requires a manual test pass that takes days or weeks, the testing cadence
limits the release cadence. The team cannot release until QA finishes, and QA cannot finish faster
because the test suite is manual and grows with every feature.
Read more: Manual Regression Testing Gates
How to narrow it down
- Is the deployment process automated? If deploying requires human steps beyond pressing a
button, the process itself is the bottleneck. Start with
Manual Deployments.
- Does a pipeline exist that can take code from commit to production? If not, the team cannot
release on demand because the infrastructure does not exist. Start with
Missing Deployment Pipeline.
- Does a committee or approval board gate production changes? If releases wait for scheduled
approval meetings, the approval cadence is the constraint. Start with
CAB Gates.
- How large is the typical work item? If features take weeks and are delivered as single
units, the batch size is the constraint. Start with
Monolithic Work Items.
- Does a manual test pass gate every release? If QA takes days per release, the testing
process is the constraint. Start with
Manual Regression Testing Gates.
Related Content
2.5 - Merge Freezes Before Deployments
Developers announce merge freezes because the integration process is fragile. Deploying requires coordination in chat.
What you are seeing
A message appears in the team chat: “Please don’t merge to main, I’m about to deploy.” The
deployment process requires the main branch to be stable and unchanged for the duration of the
deploy. Any merge during that window could invalidate the tested artifact, break the build, or
create an inconsistent state between what was tested and what ships.
Other developers queue up their PRs and wait. If the deployment hits a problem, the freeze
extends. Sometimes the freeze lasts hours. In the worst cases, the team informally agrees on
“deployment windows” where merging is allowed at certain times and deployments happen at others.
The merge freeze is a coordination tax. Every deployment interrupts the entire team’s workflow.
Developers learn to time their merges around deploy schedules, adding mental overhead to routine
work.
Common causes
Manual Deployments
When deployment is a manual process (running scripts, clicking through UIs, executing a runbook),
the person deploying needs the environment to hold still. Any change to main during the deployment
window could mean the deployed artifact does not match what was tested. Automated deployments that
build, test, and deploy atomically eliminate this window because the pipeline handles the full
sequence without requiring a stable pause.
Read more: Manual Deployments
Integration Deferred
When the team does not have a reliable CI process, merging to main is itself risky. If the build
breaks after a merge, the deployment is blocked. The team freezes merges not just to protect the
deployment but because they lack confidence that any given merge will keep main green. If CI were
reliable, merging and deploying could happen concurrently because main would always be deployable.
Read more: Integration Deferred
Missing Deployment Pipeline
When there is no pipeline that takes a specific commit through build, test, and deploy as a single
atomic operation, the team must manually coordinate which commit gets deployed. A pipeline pins
the deployment to a specific artifact built from a specific commit. Without it, the team must
freeze merges to prevent the target from moving while they deploy.
Read more: Missing Deployment Pipeline
How to narrow it down
- Is the deployment process automated end-to-end? If a human executes deployment steps, the
freeze protects against variance in the manual process. Start with
Manual Deployments.
- Does the team trust that main is always deployable? If merges to main sometimes break the
build, the freeze protects against unreliable integration. Start with
Integration Deferred.
- Does the pipeline deploy a specific artifact from a specific commit? If there is no
pipeline that pins the deployment to an immutable artifact, the team must manually ensure the
target does not move. Start with
Missing Deployment Pipeline.
Related Content
2.6 - Staging Passes but Production Fails
Deployments pass every pre-production check but break when they reach production.
What you are seeing
Code passes tests, QA signs off, staging looks fine. Then the release
hits production and something breaks: a feature behaves differently, a dependent service times
out, or data that never appeared in staging triggers an unhandled edge case.
The team scrambles to roll back or hotfix. Confidence in the pipeline drops. People start adding
more manual verification steps, which slows delivery without actually preventing the next
surprise.
Common causes
Snowflake Environments
When each environment is configured by hand (or was set up once and has drifted since), staging
and production are never truly the same. Different library versions, different environment
variables, different network configurations. Code that works in one context silently fails in
another because the environments are only superficially similar.
Read more: Snowflake Environments
Blind Operations
Sometimes the problem is not that staging passes and production fails. It is that production
failures go undetected until a customer reports them. Without monitoring and alerting, the team
has no way to verify production health after a deploy. “It works in staging” becomes the only
signal, and production problems surface hours or days late.
Read more: Blind Operations
Tightly Coupled Monolith
Hidden dependencies between components mean that a change in one area affects behavior in
another. In staging, these interactions may behave differently because the data is smaller, the
load is lighter, or a dependent service is stubbed. In production, the full weight of real usage
exposes coupling the team did not know existed.
Read more: Tightly Coupled Monolith
Manual Deployments
When deployment involves human steps (running scripts by hand, clicking through a console,
copying files), the process is never identical twice. A step skipped in staging, an extra
configuration applied in production, a different order of operations. The deployment itself
becomes a source of variance between environments.
Read more: Manual Deployments
How to narrow it down
- Are your environments provisioned from the same infrastructure code? If not, or if you
are not sure, start with Snowflake Environments.
- How did you discover the production failure? If a customer or support team reported it
rather than an automated alert, start with
Blind Operations.
- Does the failure involve a different service or module than the one you changed? If yes,
the issue is likely hidden coupling. Start with
Tightly Coupled Monolith.
- Is the deployment process identical and automated across all environments? If not, start
with Manual Deployments.
Related Content
3 - Integration and Feedback Problems
Symptoms related to work-in-progress, integration pain, review bottlenecks, and feedback speed.
These symptoms indicate problems with how work flows through your team. When integration is
deferred, feedback is slow, or work piles up, the team stays busy without finishing things.
Each page describes what you are seeing and links to the anti-patterns most likely causing it.
How to use this section
Start with the symptom that matches what your team experiences. Each symptom page explains what
you are seeing, identifies the most likely root causes (anti-patterns), and provides diagnostic
questions to narrow down which cause applies to your situation. Follow the anti-pattern link to
find concrete fix steps.
Related anti-pattern categories: Team Workflow Anti-Patterns,
Branching and Integration Anti-Patterns
Related guides: Trunk-Based Development,
Work Decomposition,
Limiting WIP
3.1 - Everything Started, Nothing Finished
The board shows many items in progress but few reaching done. The team is busy but not delivering.
What you are seeing
Open the team’s board on any given day. Count the items in progress. Count the team members. If
the first number is significantly higher than the second, the team has a WIP problem. Every
developer is working on a different story. Eight items in progress, zero done. Nothing gets the
focused attention needed to finish.
At the end of the sprint, there is a scramble to close anything. Stories that were “almost done”
for days finally get pushed through. Cycle time is long and unpredictable. The team is busy all
the time but finishes very little.
Common causes
Push-Based Work Assignment
When managers assign work to individuals rather than letting the team pull from a prioritized
backlog, each person ends up with their own queue of assigned items. WIP grows because work is
distributed across individuals rather than flowing through the team. Nobody swarms on blocked
items because everyone is busy with “their” assigned work.
Read more: Push-Based Work Assignment
Horizontal Slicing
When work is split by technical layer (“build the database schema,” “build the API,” “build the
UI”), each layer must be completed before anything is deployable. Multiple developers work on
different layers of the same feature simultaneously, all “in progress,” none independently done.
WIP is high because the decomposition prevents any single item from reaching completion quickly.
Read more: Horizontal Slicing
Unbounded WIP
When the team has no explicit constraint on how many items can be in progress simultaneously,
there is nothing to prevent WIP from growing. Developers start new work whenever they are
blocked, waiting for review, or between tasks. Without a limit, the natural tendency is to stay
busy by starting things rather than finishing them.
Read more: Unbounded WIP
How to narrow it down
- Does each developer have their own assigned backlog of work? If yes, the assignment model
prevents swarming and drives individual queues. Start with
Push-Based Work Assignment.
- Are work items split by technical layer rather than by user-visible behavior? If yes,
items cannot be completed independently. Start with
Horizontal Slicing.
- Is there any explicit limit on how many items can be in progress at once? If no, the team
has no mechanism to stop starting and start finishing. Start with
Unbounded WIP.
Related Content
3.2 - Feedback Takes Hours Instead of Minutes
The time from making a change to knowing whether it works is measured in hours, not minutes. Developers batch changes to avoid waiting.
What you are seeing
A developer makes a change and wants to know if it works. They push to CI and wait 45 minutes for
the pipeline. Or they open a PR and wait two days for a review. Or they deploy to staging and wait
for a manual QA pass that happens next week. By the time feedback arrives, the developer has moved
on to something else.
The slow feedback changes developer behavior. They batch multiple changes into a single commit to
avoid waiting multiple times. They skip local verification and push larger, less certain changes.
They start new work before the previous change is validated, juggling multiple incomplete tasks.
When feedback finally arrives and something is wrong, the developer must context-switch back. The
mental model from the original change has faded. Debugging takes longer because the developer is
working from memory rather than from active context. If multiple changes were batched, the
developer must untangle which one caused the failure.
Common causes
Inverted Test Pyramid
When most tests are slow E2E tests, the test feedback loop is measured in tens of minutes rather
than seconds. Unit tests provide feedback in seconds. E2E tests take minutes or hours. A team with
a fast unit test suite can verify a change in under a minute. A team whose testing relies on E2E
tests cannot get feedback faster than those tests can run.
Read more: Inverted Test Pyramid
Integration Deferred
When the team does not integrate frequently (at least daily), the feedback loop for integration
problems is as long as the branch lifetime. A developer working on a two-week branch does not
discover integration conflicts until they merge. Daily integration catches conflicts within hours.
Continuous integration catches them within minutes.
Read more: Integration Deferred
Manual Testing Only
When there are no automated tests, the only feedback comes from manual verification. A developer
makes a change and must either test it manually themselves (slow) or wait for someone else to test
it (slower). Automated tests provide feedback in the pipeline without requiring human effort or
scheduling.
Read more: Manual Testing Only
Long-Lived Feature Branches
When pull requests wait days for review, the code review feedback loop dominates total cycle time.
A developer finishes a change in two hours, then waits two days for review. The review feedback
loop is 24 times longer than the development time. Long-lived branches produce large PRs, and
large PRs take longer to review. Fast feedback requires fast reviews, which requires small PRs,
which requires short-lived branches.
Read more: Long-Lived Feature Branches
Manual Regression Testing Gates
When every change must pass through a manual QA gate, the feedback loop includes human scheduling.
The QA team has a queue. The change waits in line. When the tester gets to it, days have passed.
Automated testing in the pipeline replaces this queue with instant feedback.
Read more: Manual Regression Testing Gates
How to narrow it down
- How fast can the developer verify a change locally? If the local test suite takes more than
a few minutes, the test strategy is the bottleneck. Start with
Inverted Test Pyramid.
- How frequently does the team integrate to main? If developers work on branches for days
before integrating, the integration feedback loop is the bottleneck. Start with
Integration Deferred.
- Are there automated tests at all? If the only feedback is manual testing, the lack of
automation is the bottleneck. Start with
Manual Testing Only.
- How long do PRs wait for review? If review turnaround is measured in days, the review
process is the bottleneck. Start with
Long-Lived Feature Branches.
- Is there a manual QA gate in the pipeline? If changes wait in a QA queue, the manual gate
is the bottleneck. Start with
Manual Regression Testing Gates.
Related Content
3.3 - Merging Is Painful and Time-Consuming
Integration is a dreaded, multi-day event. Teams delay merging because it is painful, which makes the next merge even worse.
What you are seeing
A developer has been working on a feature branch for two weeks. They open a pull request and
discover dozens of conflicts across multiple files. Other developers have changed the same areas
of the codebase. Resolving the conflicts takes a full day. Some conflicts are straightforward
(two people edited adjacent lines), but others are semantic (two people changed the same
function’s behavior in different ways). The developer must understand both changes to merge
correctly.
After resolving conflicts, the tests fail. The merged code compiles but does not work because the
two changes are logically incompatible. The developer spends another half-day debugging the
interaction. By the time the branch is merged, the developer has spent more time integrating than
they spent building the feature.
The team knows merging is painful, so they delay it. The delay makes the next merge worse because
more code has diverged. The cycle repeats until someone declares a “merge day” and the team spends
an entire day resolving accumulated drift.
Common causes
Long-Lived Feature Branches
When branches live for weeks or months, they accumulate divergence from the main line. The longer
the branch lives, the more changes happen on main that the branch does not include. At merge time,
all of that divergence must be reconciled at once. A branch that is one day old has almost no
conflicts. A branch that is two weeks old may have dozens.
Read more: Long-Lived Feature Branches
Integration Deferred
When the team does not practice continuous integration (integrating to main at least daily), each
developer’s work diverges independently. The build may be green on each branch but broken when
branches combine. CI means integrating continuously, not running a build server. Without frequent
integration, merge pain is inevitable.
Read more: Integration Deferred
Monolithic Work Items
When work items are too large to complete in a day or two, developers must stay on a branch for
the duration. A story that takes a week forces a week-long branch. Breaking work into smaller
increments that can be integrated daily eliminates the divergence window that causes painful
merges.
Read more: Monolithic Work Items
How to narrow it down
- How long do branches typically live before merging? If branches live longer than two days,
the branch lifetime is the primary driver of merge pain. Start with
Long-Lived Feature Branches.
- Does the team integrate to main at least once per day? If developers work in isolation for
days before integrating, they are not practicing continuous integration regardless of whether a
CI server exists. Start with
Integration Deferred.
- How large are the typical work items? If stories take a week or more, the work
decomposition forces long branches. Start with
Monolithic Work Items.
Related Content
3.4 - Pull Requests Sit for Days Waiting for Review
Pull requests queue up and wait. Authors have moved on by the time feedback arrives.
What you are seeing
A developer opens a pull request and waits. Hours pass. A day passes. They ping someone in chat.
Eventually, comments arrive, but the author has moved on to something else and has to reload
context to respond. Another round of comments. Another wait. The PR finally merges two or three
days after it was opened.
The team has five or more open PRs at any time. Some are days old. Developers start new work
while they wait, which creates more PRs, which creates more review load, which slows reviews
further.
Common causes
Long-Lived Feature Branches
When developers work on branches for days, the resulting PRs are large. Large PRs take longer to
review because reviewers need more time to understand the scope of the change. A 300-line PR is
daunting. A 50-line PR takes 10 minutes. The branch length drives the PR size, which drives the
review delay.
Read more: Long-Lived Feature Branches
Knowledge Silos
When only specific individuals can review certain areas of the codebase, those individuals become
bottlenecks. Their review queue grows while other team members who could review are not
considered qualified. The constraint is not review capacity in general but review capacity for
specific code areas concentrated in too few people.
Read more: Knowledge Silos
Push-Based Work Assignment
When work is assigned to individuals, reviewing someone else’s code feels like a distraction
from “my work.” Every developer has their own assigned stories to protect. Helping a teammate
finish their work by reviewing their PR competes with the developer’s own assignments. The
incentive structure deprioritizes collaboration.
Read more: Push-Based Work Assignment
How to narrow it down
- Are PRs larger than 200 lines on average? If yes, the reviews are slow because the
changes are too large to review quickly. Start with
Long-Lived Feature Branches
and the work decomposition that feeds them.
- Are reviews waiting on specific individuals? If most PRs are assigned to or waiting on
one or two people, the team has a knowledge bottleneck. Start with
Knowledge Silos.
- Do developers treat review as lower priority than their own coding work? If yes, the
team’s norms do not treat review as a first-class activity. Start with
Push-Based Work Assignment and
establish a team working agreement that reviews happen before starting new work.
Related Content
3.5 - Pipelines Take Too Long
CI/CD pipelines take 30 minutes or more. Developers stop waiting and lose the feedback loop.
What you are seeing
A developer pushes a commit and waits. Thirty minutes pass. An hour. The pipeline is still
running. The developer context-switches to another task, and by the time the pipeline finishes
(or fails), they have moved on mentally. If the build fails, they must reload context, figure out
what went wrong, fix it, push again, and wait another 30 minutes.
Developers stop running the full test suite locally because it takes too long. They push and hope.
Some developers batch multiple changes into a single push to avoid waiting multiple times, which
makes failures harder to diagnose. Others skip the pipeline entirely for small changes and merge
with only local verification.
The pipeline was supposed to provide fast feedback. Instead, it provides slow feedback that
developers work around rather than rely on.
Common causes
Inverted Test Pyramid
When most of the test suite consists of end-to-end or integration tests rather than unit tests,
the pipeline is dominated by slow, resource-intensive test execution. E2E tests launch browsers,
spin up services, and wait for network responses. A test suite with thousands of unit tests (that
run in seconds) and a small number of targeted E2E tests is fast. A suite with hundreds of E2E
tests and few unit tests is slow by construction.
Read more: Inverted Test Pyramid
Snowflake Environments
When pipeline environments are not standardized or reproducible, builds include extra time for
environment setup, dependency installation, and configuration. Caching is unreliable because the
environment state is unpredictable. A pipeline that spends 15 minutes downloading dependencies
because there is no reliable cache layer is slow for infrastructure reasons, not test reasons.
Read more: Snowflake Environments
Tightly Coupled Monolith
When the codebase has no clear module boundaries, every change triggers a full rebuild and a full
test run. The pipeline cannot selectively build or test only the affected components because the
dependency graph is tangled. A change to one module might affect any other module, so the pipeline
must verify everything.
Read more: Tightly Coupled Monolith
Manual Regression Testing Gates
When the pipeline includes a manual testing phase, the wall-clock time from push to green
includes human wait time. A pipeline that takes 10 minutes to build and test but then waits two
days for manual sign-off is not a 10-minute pipeline. It is a two-day pipeline with a 10-minute
automated prefix.
Read more: Manual Regression Testing Gates
How to narrow it down
- What percentage of pipeline time is spent running tests? If test execution dominates and
most tests are E2E or integration tests, the test strategy is the bottleneck. Start with
Inverted Test Pyramid.
- How much time is spent on environment setup and dependency installation? If the pipeline
spends significant time on infrastructure before any tests run, the build environment is the
bottleneck. Start with
Snowflake Environments.
- Can the pipeline build and test only the changed components? If every change triggers a
full rebuild, the architecture prevents selective testing. Start with
Tightly Coupled Monolith.
- Does the pipeline include any manual steps? If a human must approve or act before the
pipeline completes, the human is the bottleneck. Start with
Manual Regression Testing Gates.
Related Content
3.6 - Work Items Take Days or Weeks to Complete
Stories regularly take more than a week from start to done. Developers go days without integrating.
What you are seeing
A developer picks up a work item on Monday. By Wednesday, they are still working on it. By
Friday, it is “almost done.” The following Monday, they are fixing edge cases. The item finally
moves to review mid-week as a 300-line pull request that the reviewer does not have time to look
at carefully.
Cycle time is measured in weeks, not days. The team commits to work at the start of the sprint
and scrambles at the end. Estimates are off by a factor of two because large items hide unknowns
that only surface mid-implementation.
Common causes
Horizontal Slicing
When work is split by technical layer rather than by user-visible behavior, each item spans an
entire layer and takes days to complete. “Build the database schema,” “build the API,” “build the
UI” are each multi-day items. Nothing is deployable until all layers are done. Vertical slicing
(cutting thin slices through all layers to deliver complete functionality) produces items that
can be finished in one to two days.
Read more: Horizontal Slicing
Monolithic Work Items
When the team takes requirements as they arrive without breaking them into smaller pieces, work
items are as large as the feature they describe. A ticket titled “Add user profile page” hides
a login form, avatar upload, email verification, notification preferences, and password reset.
Without a decomposition practice during refinement, items arrive at planning already too large
to flow.
Read more: Monolithic Work Items
Long-Lived Feature Branches
When developers work on branches for days or weeks, the branch and the work item are the same
size: large. The branching model reinforces large items because there is no integration pressure
to finish quickly. Trunk-based development creates natural pressure to keep items small enough to
integrate daily.
Read more: Long-Lived Feature Branches
How to narrow it down
- Are work items split by technical layer? If the board shows items like “backend for
feature X” and “frontend for feature X,” the decomposition is horizontal. Start with
Horizontal Slicing.
- Do items arrive at planning without being broken down? If items go from “product owner
describes a feature” to “developer starts coding” without a decomposition step, start with
Monolithic Work Items.
- Do developers work on branches for more than a day? If yes, the branching model allows
and encourages large items. Start with
Long-Lived Feature Branches.
Related Content
4 - Production Visibility and Team Health
Symptoms related to production observability, incident detection, environment parity, and team sustainability.
These symptoms indicate problems with how your team sees and responds to production issues.
When problems are invisible until customers report them, or when the team is burning out from
process overhead, the delivery system is working against the people in it. Each page describes
what you are seeing and links to the anti-patterns most likely causing it.
How to use this section
Start with the symptom that matches what your team experiences. Each symptom page explains what
you are seeing, identifies the most likely root causes (anti-patterns), and provides diagnostic
questions to narrow down which cause applies to your situation. Follow the anti-pattern link to
find concrete fix steps.
Related anti-pattern categories: Monitoring and Observability Anti-Patterns,
Organizational and Cultural Anti-Patterns
Related guides: Progressive Rollout,
Working Agreements,
Metrics-Driven Improvement
4.1 - Team Burnout and Unsustainable Pace
The team is exhausted. Every sprint is a crunch sprint. There is no time for learning, improvement, or recovery.
What you are seeing
The team is always behind. Sprint commitments are missed or met only through overtime. Developers
work evenings and weekends to hit deadlines, then start the next sprint already tired. There is no
buffer for unplanned work, so every production incident or stakeholder escalation blows up the
plan.
Nobody has time for learning, experimentation, or process improvement. Suggestions like “let’s
improve our test suite” or “let’s automate that deployment” are met with “we don’t have time.”
The irony is that the manual work those improvements would eliminate is part of what keeps the
team too busy.
Attrition risk is high. The most experienced developers leave first because they have options.
Their departure increases the load on whoever remains, accelerating the cycle.
Common causes
Thin-Spread Teams
When a small team owns too many products, every developer is stretched across multiple codebases.
Context switching consumes 20 to 40 percent of their capacity. The team looks fully utilized but
delivers less than a focused team half its size. The utilization trap (“keep everyone busy”) masks
the real problem: the team has more responsibilities than it can sustain.
Read more: Thin-Spread Teams
Deadline-Driven Development
When every sprint is driven by an arbitrary deadline, the team never operates at a sustainable
pace. There is no recovery period after a crunch because the next deadline starts immediately.
Quality is the first casualty, which creates rework, which consumes future capacity, which makes
the next deadline even harder to meet. The cycle accelerates until the team collapses.
Read more: Deadline-Driven Development
Unbounded WIP
When there is no limit on work in progress, the team starts many things and finishes few. Every
developer juggles multiple items, each getting fragmented attention. The sensation of being
constantly busy but never finishing anything is a direct contributor to burnout. The team is
working hard on everything and completing nothing.
Read more: Unbounded WIP
Velocity as Individual Metric
When individual story points are tracked, developers cannot afford to help each other, take time
to learn, or invest in quality. Every hour must produce measurable output. The pressure to perform
individually eliminates the slack that teams need to stay healthy. Helping a teammate, mentoring
a junior developer, or improving a build script all become career risks because they do not
produce points.
Read more: Velocity as Individual Metric
How to narrow it down
- Is the team responsible for more products than it can sustain? If developers are spread
across many products with constant context switching, the workload exceeds what the team
structure can handle. Start with
Thin-Spread Teams.
- Is every sprint driven by an external deadline? If the team has not had a sprint without
deadline pressure in months, the pace is unsustainable by design. Start with
Deadline-Driven Development.
- Does the team have more items in progress than team members? If WIP is unbounded and
developers juggle multiple items, the team is thrashing rather than delivering. Start with
Unbounded WIP.
- Are individuals measured by story points or velocity? If developers feel pressure to
maximize personal output at the expense of collaboration and sustainability, the measurement
system is contributing to burnout. Start with
Velocity as Individual Metric.
Related Content
4.2 - Production Issues Discovered by Customers
The team finds out about production problems from support tickets, not alerts.
What you are seeing
The team deploys a change. Someone asks “is it working?” Nobody knows. There is no dashboard to
check. There are no metrics to compare before and after. The team waits. If nobody complains
within an hour, they assume the deployment was successful.
When something does go wrong, the team finds out from a customer support ticket, a Slack message
from another team, or an executive asking why the site is slow. The investigation starts with
SSH-ing into a server and reading raw log files. Hours pass before anyone understands what
happened, what caused it, or how many users were affected.
Common causes
Blind Operations
The team has no application-level metrics, no centralized logging, and no alerting. The
infrastructure may report that servers are running, but nobody can tell whether the application
is actually working correctly. Without instrumentation, the only way to discover a problem is to
wait for someone to experience it and report it.
Read more: Blind Operations
Manual Deployments
When deployments involve human steps (running scripts by hand, clicking through a console),
there is no automated verification step. The deployment process ends when the human finishes the
steps, not when the system confirms it is healthy. Without an automated pipeline that checks
health metrics after deploying, verification falls to manual spot-checking or waiting for
complaints.
Read more: Manual Deployments
Missing Deployment Pipeline
When there is no automated path from commit to production, there is nowhere to integrate
automated health checks. A deployment pipeline can include post-deploy verification that
compares metrics before and after. Without a pipeline, verification is entirely manual and
usually skipped under time pressure.
Read more: Missing Deployment Pipeline
How to narrow it down
- Does the team have application-level metrics and alerts? If no, the team has no way to
detect problems automatically. Start with
Blind Operations.
- Is the deployment process automated with health checks? If deployments are manual or
automated without post-deploy verification, problems go undetected until users report them.
Start with Manual Deployments or
Missing Deployment Pipeline.
- Does the team check a dashboard after every deployment? If the answer is “sometimes” or
“we click through the app manually,” the verification step is unreliable. Start with
Blind Operations to build
automated verification.
Related Content
4.3 - Production Problems Are Discovered Hours or Days Late
Issues in production are not discovered until users report them. There is no automated detection or alerting.
What you are seeing
A deployment goes out on Tuesday. On Thursday, a support ticket comes in: a feature is broken for
a subset of users. The team investigates and discovers the problem was introduced in Tuesday’s
deploy. For two days, users experienced the issue while the team had no idea.
Or a performance degradation appears gradually. Response times creep up over a week. Nobody
notices until a customer complains or a business metric drops. The team checks the dashboards and
sees the degradation started after a specific deploy, but the deploy was days ago and the trail is
cold.
The team deploys carefully and then “watches for a while.” Watching means checking a few URLs
manually or refreshing a dashboard for 15 minutes. If nothing obviously breaks in that window, the
deployment is declared successful. Problems that manifest slowly, affect a subset of users, or
appear under specific conditions go undetected.
Common causes
Blind Operations
When the team has no monitoring, no alerting, and no aggregated logging, production is a black
box. The only signal that something is wrong comes from users, support staff, or business reports.
The team cannot detect problems because they have no instruments to detect them with. Adding
observability (metrics, structured logging, distributed tracing, alerting) gives the team eyes on
production.
Read more: Blind Operations
Undone Work
When the team’s definition of done does not include post-deployment verification, nobody is
responsible for confirming that the deployment is healthy. The story is “done” when the code is
merged or deployed, not when it is verified in production. Health checks, smoke tests, and canary
analysis are not part of the workflow because the workflow ends before production.
Read more: Undone Work
Manual Deployments
When deployments are manual, there is no automated post-deploy verification step. An automated
pipeline can include health checks, smoke tests, and rollback triggers as part of the deployment
sequence. A manual deployment ends when the human finishes the runbook. Whether the deployment is
actually healthy is a separate question that may or may not get answered.
Read more: Manual Deployments
How to narrow it down
- Does the team have production monitoring with alerting thresholds? If not, the team cannot
detect problems that users do not report. Start with
Blind Operations.
- Does the team’s definition of done include post-deploy verification? If stories are closed
before production health is confirmed, nobody owns the detection step. Start with
Undone Work.
- Does the deployment process include automated health checks? If deployments end when the
human finishes the script, there is no automated verification. Start with
Manual Deployments.
Related Content
4.4 - It Works on My Machine
Code that works in one developer’s environment fails in another, in CI, or in production. Environment differences make results unreproducible.
What you are seeing
A developer runs the application locally and everything works. They push to CI and the build
fails. Or a teammate pulls the same branch and gets a different result. Or a bug report comes in
that nobody can reproduce locally.
The team spends hours debugging only to discover the issue is environmental: a different Node
version, a missing system library, a different database encoding, or a service running on the
developer’s machine that is not available in CI. The code is correct. The environments are
different.
New team members experience this acutely. Setting up a development environment takes days of
following an outdated wiki page, asking teammates for help, and discovering undocumented
dependencies. Every developer’s machine accumulates unique configuration over time, making “works
on my machine” a common refrain and a useless debugging signal.
Common causes
Snowflake Environments
When development environments are set up manually and maintained individually, each developer’s
machine becomes unique. One developer installed Python 3.9, another has 3.11. One has PostgreSQL
14, another has 15. These differences are invisible until someone hits a version-specific behavior.
Reproducible, containerized development environments eliminate the variance by ensuring every
developer works in an identical setup.
Read more: Snowflake Environments
Manual Deployments
When environment setup is a manual process documented in a wiki or README, it is never followed
identically. Each developer interprets the instructions slightly differently, installs a slightly
different version, or skips a step that seems optional. The manual process guarantees divergence
over time. Infrastructure as code and automated setup scripts ensure consistency.
Read more: Manual Deployments
Tightly Coupled Monolith
When the application has implicit dependencies on its environment (specific file paths, locally
running services, system-level configuration), it is inherently sensitive to environmental
differences. Well-designed code with explicit, declared dependencies works the same way
everywhere. Code that reaches into its runtime environment for undeclared dependencies works only
where those dependencies happen to exist.
Read more: Tightly Coupled Monolith
How to narrow it down
- Do all developers use the same OS, runtime versions, and dependency versions? If not,
environment divergence is the most likely cause. Start with
Snowflake Environments.
- Is the development environment setup automated or manual? If it is a wiki page that takes
a day to follow, the manual process creates the divergence. Start with
Manual Deployments.
- Does the application depend on local services, file paths, or system configuration that is
not declared in the codebase? If the application has implicit environmental dependencies,
it will behave differently wherever those dependencies differ. Start with
Tightly Coupled Monolith.
Related Content
5 - Find Your Symptom
Answer a few questions to narrow down which dysfunction symptoms match your situation.
Expand the category that best describes what your team is experiencing, then follow the
sub-questions to find the most relevant symptom pages.
We have problems with our tests
Tests pass sometimes and fail sometimes without code changes
Your tests are non-deterministic. This is often caused by environment differences or test
architecture that depends on external systems.
We have good coverage numbers but bugs still reach production
Coverage measures which lines execute, not whether the tests verify correct behavior. High
coverage with low defect detection points to a test design problem.
Refactoring is risky because it breaks tests
When tests are coupled to implementation details rather than behavior, any internal change
causes test failures even when the behavior is correct.
The test suite takes too long to run
Slow tests delay feedback and encourage developers to skip running them locally.
Deploying and releasing is painful
The team avoids or dreads deployments
When deployments frequently cause incidents, the team learns to treat them as high-risk events.
We need to coordinate multiple services or teams to deploy
Deployment coordination signals architectural coupling or process constraints.
We need a stabilization period before each release
If you need dedicated time to “harden” before releasing, the normal development process is not
producing releasable code.
Work is slow and things pile up
Lots of things are in progress but few are finishing
High work-in-progress means the team is spread thin. Nothing gets the focus needed to finish.
Merging and integrating code is difficult
When integration is deferred, branches diverge and merging becomes painful.
Feedback on changes takes too long
Slow feedback loops mean developers context-switch away and problems grow before they are caught.
Production problems and team health
Customers find problems before we do
If your monitoring does not catch issues before users report them, you have an observability gap.
Code behaves differently in different environments
Environment inconsistency makes it impossible to reproduce problems reliably.
The team is exhausted from process overhead
When the delivery process creates friction at every step, the team burns out.
6 - Symptoms for Developers
Dysfunction symptoms grouped by the friction developers and tech leads experience - from daily coding pain to team-level delivery patterns.
These are the symptoms you experience while writing, testing, and shipping code. Some you feel
personally. Others you see as patterns across the team. If something on this list sounds
familiar, follow the link to find what is causing it and how to fix it.
Pushing code and getting feedback
Tests getting in the way
- Tests Randomly Pass or Fail - You click rerun without investigating because flaky failures are so common. The team ignores failures by default, which masks real regressions.
- Refactoring Breaks Tests - You rename a method or restructure a class and 15 tests fail, even though the behavior is correct. Technical debt accumulates because cleanup is too expensive.
- Test Suite Is Too Slow to Run - Running tests locally is so slow that you skip it and push to CI instead, trading fast feedback for a longer loop.
- High Coverage but Tests Miss Defects - Coverage is above 80% but bugs still make it to production. The tests check that code runs, not that it works correctly.
Integrating and merging
Deploying and releasing
Environment and production surprises
7 - Symptoms for Managers
Dysfunction symptoms grouped by business impact - unpredictable delivery, quality, and team health.
These are the symptoms that show up in sprint reviews, quarterly planning, and 1-on-1s. They
manifest as missed commitments, quality problems, and retention risk.
Unpredictable delivery
Quality reaching customers
Coordination overhead
Team health and retention
What to do next
If these symptoms sound familiar, these resources can help you build a case for change and
find a starting point: