This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Production Visibility and Team Health

Symptoms related to production observability, incident detection, environment parity, and team sustainability.

These symptoms indicate problems with how your team sees and responds to production issues. When problems are invisible until customers report them, or when the team is burning out from process overhead, the delivery system is working against the people in it. Each page describes what you are seeing and links to the anti-patterns most likely causing it.

How to use this section

Start with the symptom that matches what your team experiences. Each symptom page explains what you are seeing, identifies the most likely root causes (anti-patterns), and provides diagnostic questions to narrow down which cause applies to your situation. Follow the anti-pattern link to find concrete fix steps.

Related anti-pattern categories: Monitoring and Observability Anti-Patterns, Organizational and Cultural Anti-Patterns

Related guides: Progressive Rollout, Working Agreements, Metrics-Driven Improvement

1 - Team Burnout and Unsustainable Pace

The team is exhausted. Every sprint is a crunch sprint. There is no time for learning, improvement, or recovery.

What you are seeing

The team is always behind. Sprint commitments are missed or met only through overtime. Developers work evenings and weekends to hit deadlines, then start the next sprint already tired. There is no buffer for unplanned work, so every production incident or stakeholder escalation blows up the plan.

Nobody has time for learning, experimentation, or process improvement. Suggestions like “let’s improve our test suite” or “let’s automate that deployment” are met with “we don’t have time.” The irony is that the manual work those improvements would eliminate is part of what keeps the team too busy.

Attrition risk is high. The most experienced developers leave first because they have options. Their departure increases the load on whoever remains, accelerating the cycle.

Common causes

Thin-Spread Teams

When a small team owns too many products, every developer is stretched across multiple codebases. Context switching consumes 20 to 40 percent of their capacity. The team looks fully utilized but delivers less than a focused team half its size. The utilization trap (“keep everyone busy”) masks the real problem: the team has more responsibilities than it can sustain.

Read more: Thin-Spread Teams

Deadline-Driven Development

When every sprint is driven by an arbitrary deadline, the team never operates at a sustainable pace. There is no recovery period after a crunch because the next deadline starts immediately. Quality is the first casualty, which creates rework, which consumes future capacity, which makes the next deadline even harder to meet. The cycle accelerates until the team collapses.

Read more: Deadline-Driven Development

Unbounded WIP

When there is no limit on work in progress, the team starts many things and finishes few. Every developer juggles multiple items, each getting fragmented attention. The sensation of being constantly busy but never finishing anything is a direct contributor to burnout. The team is working hard on everything and completing nothing.

Read more: Unbounded WIP

Velocity as Individual Metric

When individual story points are tracked, developers cannot afford to help each other, take time to learn, or invest in quality. Every hour must produce measurable output. The pressure to perform individually eliminates the slack that teams need to stay healthy. Helping a teammate, mentoring a junior developer, or improving a build script all become career risks because they do not produce points.

Read more: Velocity as Individual Metric

How to narrow it down

  1. Is the team responsible for more products than it can sustain? If developers are spread across many products with constant context switching, the workload exceeds what the team structure can handle. Start with Thin-Spread Teams.
  2. Is every sprint driven by an external deadline? If the team has not had a sprint without deadline pressure in months, the pace is unsustainable by design. Start with Deadline-Driven Development.
  3. Does the team have more items in progress than team members? If WIP is unbounded and developers juggle multiple items, the team is thrashing rather than delivering. Start with Unbounded WIP.
  4. Are individuals measured by story points or velocity? If developers feel pressure to maximize personal output at the expense of collaboration and sustainability, the measurement system is contributing to burnout. Start with Velocity as Individual Metric.

2 - Production Issues Discovered by Customers

The team finds out about production problems from support tickets, not alerts.

What you are seeing

The team deploys a change. Someone asks “is it working?” Nobody knows. There is no dashboard to check. There are no metrics to compare before and after. The team waits. If nobody complains within an hour, they assume the deployment was successful.

When something does go wrong, the team finds out from a customer support ticket, a Slack message from another team, or an executive asking why the site is slow. The investigation starts with SSH-ing into a server and reading raw log files. Hours pass before anyone understands what happened, what caused it, or how many users were affected.

Common causes

Blind Operations

The team has no application-level metrics, no centralized logging, and no alerting. The infrastructure may report that servers are running, but nobody can tell whether the application is actually working correctly. Without instrumentation, the only way to discover a problem is to wait for someone to experience it and report it.

Read more: Blind Operations

Manual Deployments

When deployments involve human steps (running scripts by hand, clicking through a console), there is no automated verification step. The deployment process ends when the human finishes the steps, not when the system confirms it is healthy. Without an automated pipeline that checks health metrics after deploying, verification falls to manual spot-checking or waiting for complaints.

Read more: Manual Deployments

Missing Deployment Pipeline

When there is no automated path from commit to production, there is nowhere to integrate automated health checks. A deployment pipeline can include post-deploy verification that compares metrics before and after. Without a pipeline, verification is entirely manual and usually skipped under time pressure.

Read more: Missing Deployment Pipeline

How to narrow it down

  1. Does the team have application-level metrics and alerts? If no, the team has no way to detect problems automatically. Start with Blind Operations.
  2. Is the deployment process automated with health checks? If deployments are manual or automated without post-deploy verification, problems go undetected until users report them. Start with Manual Deployments or Missing Deployment Pipeline.
  3. Does the team check a dashboard after every deployment? If the answer is “sometimes” or “we click through the app manually,” the verification step is unreliable. Start with Blind Operations to build automated verification.

3 - Production Problems Are Discovered Hours or Days Late

Issues in production are not discovered until users report them. There is no automated detection or alerting.

What you are seeing

A deployment goes out on Tuesday. On Thursday, a support ticket comes in: a feature is broken for a subset of users. The team investigates and discovers the problem was introduced in Tuesday’s deploy. For two days, users experienced the issue while the team had no idea.

Or a performance degradation appears gradually. Response times creep up over a week. Nobody notices until a customer complains or a business metric drops. The team checks the dashboards and sees the degradation started after a specific deploy, but the deploy was days ago and the trail is cold.

The team deploys carefully and then “watches for a while.” Watching means checking a few URLs manually or refreshing a dashboard for 15 minutes. If nothing obviously breaks in that window, the deployment is declared successful. Problems that manifest slowly, affect a subset of users, or appear under specific conditions go undetected.

Common causes

Blind Operations

When the team has no monitoring, no alerting, and no aggregated logging, production is a black box. The only signal that something is wrong comes from users, support staff, or business reports. The team cannot detect problems because they have no instruments to detect them with. Adding observability (metrics, structured logging, distributed tracing, alerting) gives the team eyes on production.

Read more: Blind Operations

Undone Work

When the team’s definition of done does not include post-deployment verification, nobody is responsible for confirming that the deployment is healthy. The story is “done” when the code is merged or deployed, not when it is verified in production. Health checks, smoke tests, and canary analysis are not part of the workflow because the workflow ends before production.

Read more: Undone Work

Manual Deployments

When deployments are manual, there is no automated post-deploy verification step. An automated pipeline can include health checks, smoke tests, and rollback triggers as part of the deployment sequence. A manual deployment ends when the human finishes the runbook. Whether the deployment is actually healthy is a separate question that may or may not get answered.

Read more: Manual Deployments

How to narrow it down

  1. Does the team have production monitoring with alerting thresholds? If not, the team cannot detect problems that users do not report. Start with Blind Operations.
  2. Does the team’s definition of done include post-deploy verification? If stories are closed before production health is confirmed, nobody owns the detection step. Start with Undone Work.
  3. Does the deployment process include automated health checks? If deployments end when the human finishes the script, there is no automated verification. Start with Manual Deployments.

4 - It Works on My Machine

Code that works in one developer’s environment fails in another, in CI, or in production. Environment differences make results unreproducible.

What you are seeing

A developer runs the application locally and everything works. They push to CI and the build fails. Or a teammate pulls the same branch and gets a different result. Or a bug report comes in that nobody can reproduce locally.

The team spends hours debugging only to discover the issue is environmental: a different Node version, a missing system library, a different database encoding, or a service running on the developer’s machine that is not available in CI. The code is correct. The environments are different.

New team members experience this acutely. Setting up a development environment takes days of following an outdated wiki page, asking teammates for help, and discovering undocumented dependencies. Every developer’s machine accumulates unique configuration over time, making “works on my machine” a common refrain and a useless debugging signal.

Common causes

Snowflake Environments

When development environments are set up manually and maintained individually, each developer’s machine becomes unique. One developer installed Python 3.9, another has 3.11. One has PostgreSQL 14, another has 15. These differences are invisible until someone hits a version-specific behavior. Reproducible, containerized development environments eliminate the variance by ensuring every developer works in an identical setup.

Read more: Snowflake Environments

Manual Deployments

When environment setup is a manual process documented in a wiki or README, it is never followed identically. Each developer interprets the instructions slightly differently, installs a slightly different version, or skips a step that seems optional. The manual process guarantees divergence over time. Infrastructure as code and automated setup scripts ensure consistency.

Read more: Manual Deployments

Tightly Coupled Monolith

When the application has implicit dependencies on its environment (specific file paths, locally running services, system-level configuration), it is inherently sensitive to environmental differences. Well-designed code with explicit, declared dependencies works the same way everywhere. Code that reaches into its runtime environment for undeclared dependencies works only where those dependencies happen to exist.

Read more: Tightly Coupled Monolith

How to narrow it down

  1. Do all developers use the same OS, runtime versions, and dependency versions? If not, environment divergence is the most likely cause. Start with Snowflake Environments.
  2. Is the development environment setup automated or manual? If it is a wiki page that takes a day to follow, the manual process creates the divergence. Start with Manual Deployments.
  3. Does the application depend on local services, file paths, or system configuration that is not declared in the codebase? If the application has implicit environmental dependencies, it will behave differently wherever those dependencies differ. Start with Tightly Coupled Monolith.