This is the multi-page printable view of this section.
Click here to print.
Return to the regular view of this page.
Production Visibility and Team Health
Symptoms related to production observability, incident detection, environment parity, and team sustainability.
These symptoms indicate problems with how your team sees and responds to production issues.
When problems are invisible until customers report them, or when the team is burning out from
process overhead, the delivery system is working against the people in it. Each page describes
what you are seeing and links to the anti-patterns most likely causing it.
How to use this section
Start with the symptom that matches what your team experiences. Each symptom page explains what
you are seeing, identifies the most likely root causes (anti-patterns), and provides diagnostic
questions to narrow down which cause applies to your situation. Follow the anti-pattern link to
find concrete fix steps.
Related anti-pattern categories: Monitoring and Observability Anti-Patterns,
Organizational and Cultural Anti-Patterns
Related guides: Progressive Rollout,
Working Agreements,
Metrics-Driven Improvement
1 - Team Burnout and Unsustainable Pace
The team is exhausted. Every sprint is a crunch sprint. There is no time for learning, improvement, or recovery.
What you are seeing
The team is always behind. Sprint commitments are missed or met only through overtime. Developers
work evenings and weekends to hit deadlines, then start the next sprint already tired. There is no
buffer for unplanned work, so every production incident or stakeholder escalation blows up the
plan.
Nobody has time for learning, experimentation, or process improvement. Suggestions like “let’s
improve our test suite” or “let’s automate that deployment” are met with “we don’t have time.”
The irony is that the manual work those improvements would eliminate is part of what keeps the
team too busy.
Attrition risk is high. The most experienced developers leave first because they have options.
Their departure increases the load on whoever remains, accelerating the cycle.
Common causes
Thin-Spread Teams
When a small team owns too many products, every developer is stretched across multiple codebases.
Context switching consumes 20 to 40 percent of their capacity. The team looks fully utilized but
delivers less than a focused team half its size. The utilization trap (“keep everyone busy”) masks
the real problem: the team has more responsibilities than it can sustain.
Read more: Thin-Spread Teams
Deadline-Driven Development
When every sprint is driven by an arbitrary deadline, the team never operates at a sustainable
pace. There is no recovery period after a crunch because the next deadline starts immediately.
Quality is the first casualty, which creates rework, which consumes future capacity, which makes
the next deadline even harder to meet. The cycle accelerates until the team collapses.
Read more: Deadline-Driven Development
Unbounded WIP
When there is no limit on work in progress, the team starts many things and finishes few. Every
developer juggles multiple items, each getting fragmented attention. The sensation of being
constantly busy but never finishing anything is a direct contributor to burnout. The team is
working hard on everything and completing nothing.
Read more: Unbounded WIP
Velocity as Individual Metric
When individual story points are tracked, developers cannot afford to help each other, take time
to learn, or invest in quality. Every hour must produce measurable output. The pressure to perform
individually eliminates the slack that teams need to stay healthy. Helping a teammate, mentoring
a junior developer, or improving a build script all become career risks because they do not
produce points.
Read more: Velocity as Individual Metric
How to narrow it down
- Is the team responsible for more products than it can sustain? If developers are spread
across many products with constant context switching, the workload exceeds what the team
structure can handle. Start with
Thin-Spread Teams.
- Is every sprint driven by an external deadline? If the team has not had a sprint without
deadline pressure in months, the pace is unsustainable by design. Start with
Deadline-Driven Development.
- Does the team have more items in progress than team members? If WIP is unbounded and
developers juggle multiple items, the team is thrashing rather than delivering. Start with
Unbounded WIP.
- Are individuals measured by story points or velocity? If developers feel pressure to
maximize personal output at the expense of collaboration and sustainability, the measurement
system is contributing to burnout. Start with
Velocity as Individual Metric.
Related Content
2 - Production Issues Discovered by Customers
The team finds out about production problems from support tickets, not alerts.
What you are seeing
The team deploys a change. Someone asks “is it working?” Nobody knows. There is no dashboard to
check. There are no metrics to compare before and after. The team waits. If nobody complains
within an hour, they assume the deployment was successful.
When something does go wrong, the team finds out from a customer support ticket, a Slack message
from another team, or an executive asking why the site is slow. The investigation starts with
SSH-ing into a server and reading raw log files. Hours pass before anyone understands what
happened, what caused it, or how many users were affected.
Common causes
Blind Operations
The team has no application-level metrics, no centralized logging, and no alerting. The
infrastructure may report that servers are running, but nobody can tell whether the application
is actually working correctly. Without instrumentation, the only way to discover a problem is to
wait for someone to experience it and report it.
Read more: Blind Operations
Manual Deployments
When deployments involve human steps (running scripts by hand, clicking through a console),
there is no automated verification step. The deployment process ends when the human finishes the
steps, not when the system confirms it is healthy. Without an automated pipeline that checks
health metrics after deploying, verification falls to manual spot-checking or waiting for
complaints.
Read more: Manual Deployments
Missing Deployment Pipeline
When there is no automated path from commit to production, there is nowhere to integrate
automated health checks. A deployment pipeline can include post-deploy verification that
compares metrics before and after. Without a pipeline, verification is entirely manual and
usually skipped under time pressure.
Read more: Missing Deployment Pipeline
How to narrow it down
- Does the team have application-level metrics and alerts? If no, the team has no way to
detect problems automatically. Start with
Blind Operations.
- Is the deployment process automated with health checks? If deployments are manual or
automated without post-deploy verification, problems go undetected until users report them.
Start with Manual Deployments or
Missing Deployment Pipeline.
- Does the team check a dashboard after every deployment? If the answer is “sometimes” or
“we click through the app manually,” the verification step is unreliable. Start with
Blind Operations to build
automated verification.
Related Content
3 - Production Problems Are Discovered Hours or Days Late
Issues in production are not discovered until users report them. There is no automated detection or alerting.
What you are seeing
A deployment goes out on Tuesday. On Thursday, a support ticket comes in: a feature is broken for
a subset of users. The team investigates and discovers the problem was introduced in Tuesday’s
deploy. For two days, users experienced the issue while the team had no idea.
Or a performance degradation appears gradually. Response times creep up over a week. Nobody
notices until a customer complains or a business metric drops. The team checks the dashboards and
sees the degradation started after a specific deploy, but the deploy was days ago and the trail is
cold.
The team deploys carefully and then “watches for a while.” Watching means checking a few URLs
manually or refreshing a dashboard for 15 minutes. If nothing obviously breaks in that window, the
deployment is declared successful. Problems that manifest slowly, affect a subset of users, or
appear under specific conditions go undetected.
Common causes
Blind Operations
When the team has no monitoring, no alerting, and no aggregated logging, production is a black
box. The only signal that something is wrong comes from users, support staff, or business reports.
The team cannot detect problems because they have no instruments to detect them with. Adding
observability (metrics, structured logging, distributed tracing, alerting) gives the team eyes on
production.
Read more: Blind Operations
Undone Work
When the team’s definition of done does not include post-deployment verification, nobody is
responsible for confirming that the deployment is healthy. The story is “done” when the code is
merged or deployed, not when it is verified in production. Health checks, smoke tests, and canary
analysis are not part of the workflow because the workflow ends before production.
Read more: Undone Work
Manual Deployments
When deployments are manual, there is no automated post-deploy verification step. An automated
pipeline can include health checks, smoke tests, and rollback triggers as part of the deployment
sequence. A manual deployment ends when the human finishes the runbook. Whether the deployment is
actually healthy is a separate question that may or may not get answered.
Read more: Manual Deployments
How to narrow it down
- Does the team have production monitoring with alerting thresholds? If not, the team cannot
detect problems that users do not report. Start with
Blind Operations.
- Does the team’s definition of done include post-deploy verification? If stories are closed
before production health is confirmed, nobody owns the detection step. Start with
Undone Work.
- Does the deployment process include automated health checks? If deployments end when the
human finishes the script, there is no automated verification. Start with
Manual Deployments.
Related Content
4 - It Works on My Machine
Code that works in one developer’s environment fails in another, in CI, or in production. Environment differences make results unreproducible.
What you are seeing
A developer runs the application locally and everything works. They push to CI and the build
fails. Or a teammate pulls the same branch and gets a different result. Or a bug report comes in
that nobody can reproduce locally.
The team spends hours debugging only to discover the issue is environmental: a different Node
version, a missing system library, a different database encoding, or a service running on the
developer’s machine that is not available in CI. The code is correct. The environments are
different.
New team members experience this acutely. Setting up a development environment takes days of
following an outdated wiki page, asking teammates for help, and discovering undocumented
dependencies. Every developer’s machine accumulates unique configuration over time, making “works
on my machine” a common refrain and a useless debugging signal.
Common causes
Snowflake Environments
When development environments are set up manually and maintained individually, each developer’s
machine becomes unique. One developer installed Python 3.9, another has 3.11. One has PostgreSQL
14, another has 15. These differences are invisible until someone hits a version-specific behavior.
Reproducible, containerized development environments eliminate the variance by ensuring every
developer works in an identical setup.
Read more: Snowflake Environments
Manual Deployments
When environment setup is a manual process documented in a wiki or README, it is never followed
identically. Each developer interprets the instructions slightly differently, installs a slightly
different version, or skips a step that seems optional. The manual process guarantees divergence
over time. Infrastructure as code and automated setup scripts ensure consistency.
Read more: Manual Deployments
Tightly Coupled Monolith
When the application has implicit dependencies on its environment (specific file paths, locally
running services, system-level configuration), it is inherently sensitive to environmental
differences. Well-designed code with explicit, declared dependencies works the same way
everywhere. Code that reaches into its runtime environment for undeclared dependencies works only
where those dependencies happen to exist.
Read more: Tightly Coupled Monolith
How to narrow it down
- Do all developers use the same OS, runtime versions, and dependency versions? If not,
environment divergence is the most likely cause. Start with
Snowflake Environments.
- Is the development environment setup automated or manual? If it is a wiki page that takes
a day to follow, the manual process creates the divergence. Start with
Manual Deployments.
- Does the application depend on local services, file paths, or system configuration that is
not declared in the codebase? If the application has implicit environmental dependencies,
it will behave differently wherever those dependencies differ. Start with
Tightly Coupled Monolith.
Related Content