Production Issues Discovered by Customers
3 minute read
What you are seeing
The team deploys a change. Someone asks “is it working?” Nobody knows. There is no dashboard to check. There are no metrics to compare before and after. The team waits. If nobody complains within an hour, they assume the deployment was successful.
When something does go wrong, the team finds out from a customer support ticket, a Slack message from another team, or an executive asking why the site is slow. The investigation starts with SSH-ing into a server and reading raw log files. Hours pass before anyone understands what happened, what caused it, or how many users were affected.
Common causes
Blind Operations
The team has no application-level metrics, no centralized logging, and no alerting. The infrastructure may report that servers are running, but nobody can tell whether the application is actually working correctly. Without instrumentation, the only way to discover a problem is to wait for someone to experience it and report it.
Read more: Blind Operations
Manual Deployments
When deployments involve human steps (running scripts by hand, clicking through a console), there is no automated verification step. The deployment process ends when the human finishes the steps, not when the system confirms it is healthy. Without an automated pipeline that checks health metrics after deploying, verification falls to manual spot-checking or waiting for complaints.
Read more: Manual Deployments
Missing Deployment Pipeline
When there is no automated path from commit to production, there is nowhere to integrate automated health checks. A deployment pipeline can include post-deploy verification that compares metrics before and after. Without a pipeline, verification is entirely manual and usually skipped under time pressure.
Read more: Missing Deployment Pipeline
How to narrow it down
- Does the team have application-level metrics and alerts? If no, the team has no way to detect problems automatically. Start with Blind Operations.
- Is the deployment process automated with health checks? If deployments are manual or automated without post-deploy verification, problems go undetected until users report them. Start with Manual Deployments or Missing Deployment Pipeline.
- Does the team check a dashboard after every deployment? If the answer is “sometimes” or “we click through the app manually,” the verification step is unreliable. Start with Blind Operations to build automated verification.
Related Content
- Production Problems Are Discovered Hours or Days Late - Both symptoms indicate missing observability
- Staging Passes but Production Fails - Staging does not catch what monitoring would
- Blind Operations - No monitoring, no alerting, no visibility
- Progressive Rollout - Canary deployments that detect problems before full rollout
- Mean Time to Repair - Measure how quickly the team detects and resolves incidents