This is the multi-page printable view of this section.
Click here to print.
Return to the regular view of this page.
Phase 4: Deliver on Demand
The capability to deploy any change to production at any time, using the delivery strategy that fits your context.
Key question: “Can we deliver any change to production when the business needs it?”
This is the destination: you can deploy any change that passes the pipeline to production
whenever you choose. Some teams will auto-deploy every commit (continuous deployment). Others
will deploy on demand when the business is ready. Both are valid - the capability is what
matters, not the trigger.
What You’ll Do
- Deploy on demand - Remove the last manual gates so any green build can reach production
- Use progressive rollout - Canary, blue-green, and percentage-based deployments
- Explore agentic CD - AI-assisted continuous delivery patterns
- Learn from experience reports - How other teams made the journey
Continuous Delivery vs. Continuous Deployment
These terms are often confused. The distinction matters for this phase:
- Continuous delivery means every commit that passes the pipeline could be deployed to
production at any time. The capability exists. A human or business process decides when.
- Continuous deployment means every commit that passes the pipeline is deployed to
production automatically. No human decision is involved.
Continuous delivery is the goal of this migration guide. Continuous deployment is one delivery
strategy that works well for certain contexts - SaaS products, internal tools, services behind
feature flags. It is not a higher level of maturity. A team that deploys on demand with a
one-click deploy is just as capable as a team that auto-deploys every commit.
Why This Phase Matters
When your foundations are solid, your pipeline is reliable, and your batch sizes are small,
deploying any change becomes low-risk. The remaining barriers are organizational, not
technical: approval processes, change windows, release coordination. This phase addresses those
barriers so the team has the option to deploy whenever the business needs it.
Signs You’ve Arrived
- Any commit that passes the pipeline can reach production within minutes
- The team deploys frequently (daily or more) with no drama
- Mean time to recovery is measured in minutes
- The team has confidence that any deployment can be safely rolled back
- New team members can deploy on their first day
- The deployment strategy (on-demand or automatic) is a team choice, not a constraint
1 - Deploy on Demand
Remove the last manual gates and deploy every change that passes the pipeline.
Phase 4 - Deliver on Demand | Original content
Deploy on demand means that any change which passes the full automated pipeline can reach production without waiting for a human to press a button, open a ticket, or schedule a window. This page covers the prerequisites, the transition from continuous delivery to continuous deployment, and how to address the organizational concerns that are the real barriers.
Continuous Delivery vs. Continuous Deployment
These two terms are often confused. The distinction matters:
- Continuous Delivery: Every commit that passes the pipeline could be deployed to production. A human decides when to deploy.
- Continuous Deployment: Every commit that passes the pipeline is deployed to production. No human decision is required.
If you have completed Phases 1-3 of this migration, you have continuous delivery. This page is about removing that last manual decision and moving to continuous deployment.
Why Remove the Last Gate?
The manual deployment decision feels safe. It gives someone a chance to “eyeball” the change before it goes to production. In practice, it does the opposite.
The Problems with Manual Gates
| Problem |
Why It Happens |
Impact |
| Batching |
If deploys are manual, teams batch changes to reduce the number of deploy events |
Larger batches increase risk and make rollback harder |
| Delay |
Changes wait for someone to approve, which may take hours or days |
Longer lead time, delayed feedback |
| False confidence |
The approver cannot meaningfully review what the automated pipeline already tested |
The gate provides the illusion of safety without actual safety |
| Bottleneck |
One person or team becomes the deploy gatekeeper |
Creates a single point of failure for the entire delivery flow |
| Deploy fear |
Infrequent deploys mean each deploy is higher stakes |
Teams become more cautious, batches get larger, risk increases |
The Paradox of Manual Safety
The more you rely on manual deployment gates, the less safe your deployments become. This is because manual gates lead to batching, batching increases risk, and increased risk justifies more manual gates. It is a vicious cycle.
Continuous deployment breaks this cycle. Small, frequent, automated deployments are individually low-risk. If one fails, the blast radius is small and recovery is fast.
Prerequisites for Deploy on Demand
Before removing manual gates, verify that these conditions are met. Each one is covered in earlier phases of this migration.
Non-Negotiable Prerequisites
| Prerequisite |
What It Means |
Where to Build It |
| Comprehensive automated tests |
The test suite catches real defects, not just trivial cases |
Testing Fundamentals |
| Fast, reliable pipeline |
The pipeline completes in under 15 minutes and rarely fails for non-code reasons |
Deterministic Pipeline |
| Automated rollback |
You can roll back a bad deployment in minutes without manual intervention |
Rollback |
| Feature flags |
Incomplete features are hidden from users via flags, not deployment timing |
Feature Flags |
| Small batch sizes |
Each deployment contains 1-3 small changes, not dozens |
Small Batches |
| Production-like environments |
Test environments match production closely enough that test results are trustworthy |
Production-Like Environments |
| Observability |
You can detect production issues within minutes through monitoring and alerting |
Metrics-Driven Improvement |
Assessment: Are You Ready?
Answer these questions honestly:
- When was the last time your pipeline caught a real bug? If the answer is “I don’t remember,” your test suite may not be trustworthy enough.
- How long does a rollback take? If the answer is more than 15 minutes, automate it first.
- Do deploys ever fail for non-code reasons? (Environment issues, credential problems, network flakiness.) If yes, stabilize your pipeline first.
- Does the team trust the pipeline? If team members regularly say “let me check one more thing before we deploy,” trust is not there yet. Build it through retrospectives and transparent metrics.
The Transition: Three Approaches
Approach 1: Shadow Mode
Run continuous deployment alongside manual deployment. Every change that passes the pipeline is automatically deployed to a shadow production environment (or a canary group). A human still approves the “real” production deployment.
Duration: 2-4 weeks.
What you learn: How often the automated deployment would have been correct. If the answer is “every time” (or close to it), the manual gate is not adding value.
Transition: Once the team sees that the shadow deployments are consistently safe, remove the manual gate.
Approach 2: Opt-In per Team
Allow individual teams to adopt continuous deployment while others continue with manual gates. This works well in organizations with multiple teams at different maturity levels.
Duration: Ongoing. Teams opt in when they are ready.
What you learn: Which teams are ready and which need more foundation work. Early adopters demonstrate the pattern for the rest of the organization.
Transition: As more teams succeed, continuous deployment becomes the default. Remaining teams are supported in reaching readiness.
Approach 3: Direct Switchover
Remove the manual gate for all teams at once. This is appropriate when the organization has high confidence in its pipeline and all teams have completed Phases 1-3.
Duration: Immediate.
What you learn: Quickly reveals any hidden dependencies on the manual gate (e.g., deploy coordination between teams, configuration changes that ride along with deployments).
Transition: Be prepared to temporarily revert if unforeseen issues arise. Have a clear rollback plan for the process change itself.
Addressing Organizational Concerns
The technical prerequisites are usually met before the organizational ones. These are the conversations you will need to have.
“What about change management / ITIL?”
Change management frameworks like ITIL define a “standard change” category: a pre-approved, low-risk, well-understood change that does not require a Change Advisory Board (CAB) review. Continuous deployment changes qualify as standard changes because they are:
- Small (one to a few commits)
- Automated (same pipeline every time)
- Reversible (automated rollback)
- Well-tested (comprehensive automated tests)
Work with your change management team to classify pipeline-passing deployments as standard changes. This preserves the governance framework while removing the bottleneck.
“What about compliance and audit?”
Continuous deployment does not eliminate audit trails - it strengthens them. Every deployment is:
- Traceable: Tied to a specific commit, which is tied to a specific story or ticket
- Reproducible: The same pipeline produces the same result every time
- Recorded: Pipeline logs capture every test that passed, every approval that was automated
- Reversible: Rollback history shows when and why a deployment was reverted
Provide auditors with access to pipeline logs, deployment history, and the automated test suite. This is a more complete audit trail than a manual approval signature.
“What about database migrations?”
Database migrations require special care in continuous deployment because they cannot be rolled back as easily as code changes.
Rules for database migrations in CD:
- Migrations must be backward-compatible. The previous version of the code must work with the new schema.
- Use expand/contract pattern. First deploy the new column/table (expand). Then deploy the code that uses it. Then remove the old column/table (contract). Each step is a separate deployment.
- Never drop a column in the same deployment that stops using it. There is always a window where both old and new code run simultaneously.
- Test migrations in production-like environments before they reach production.
“What if we deploy a breaking change?”
This is why you have automated rollback and observability. The sequence is:
- Deployment happens automatically
- Monitoring detects an issue (error rate spike, latency increase, health check failure)
- Automated rollback triggers (or on-call engineer triggers manual rollback)
- The team investigates and fixes the issue
- The fix goes through the pipeline and deploys automatically
The key insight: this sequence takes minutes with continuous deployment. With manual deployment on a weekly schedule, the same breaking change would take days to detect and fix.
After the Transition
What Changes for the Team
| Before |
After |
| “Are we deploying today?” |
Deploys happen automatically, all the time |
| “Who’s doing the deploy?” |
Nobody - the pipeline does it |
| “Can I get this into the next release?” |
Every merge to trunk is the next release |
| “We need to coordinate the deploy with team X” |
Teams deploy independently |
| “Let’s wait for the deploy window” |
There are no deploy windows |
What Stays the Same
- Code review still happens (before merge to trunk)
- Automated tests still run (in the pipeline)
- Feature flags still control feature visibility (decoupling deploy from release)
- Monitoring still catches issues (but now recovery is faster)
- The team still owns its deployments (but the manual step is gone)
The First Week
The first week of continuous deployment will feel uncomfortable. This is normal. The team will instinctively want to “check” deployments that happen automatically. Resist the urge to add manual checks back. Instead:
- Watch the monitoring dashboards more closely than usual
- Have the team discuss each automatic deployment in standup for the first week
- Celebrate the first deployment that goes out without anyone noticing - that is the goal
Key Pitfalls
1. “We adopted continuous deployment but kept the approval step ‘just in case’”
If the approval step exists, it will be used, and you have not actually adopted continuous deployment. Remove the gate completely. If something goes wrong, use rollback - do not use a pre-deployment gate.
2. “Our deploy cadence didn’t actually increase”
Continuous deployment only increases deploy frequency if the team is integrating to trunk frequently. If the team still merges weekly, they will deploy weekly - automatically, but still weekly. Revisit Trunk-Based Development and Small Batches.
3. “We have continuous deployment for the application but not the database/infrastructure”
Partial continuous deployment creates a split experience: application changes flow freely but infrastructure changes still require manual coordination. Extend the pipeline to cover infrastructure as code, database migrations, and configuration changes.
Measuring Success
| Metric |
Target |
Why It Matters |
| Deployment frequency |
Multiple per day |
Confirms the pipeline is deploying every change |
| Lead time |
< 1 hour from commit to production |
Confirms no manual gates are adding delay |
| Manual interventions per deploy |
Zero |
Confirms the process is fully automated |
| Change failure rate |
Stable or improving |
Confirms automation is not introducing new failures |
| MTTR |
< 15 minutes |
Confirms automated rollback is working |
Next Step
Continuous deployment deploys every change, but not every change needs to go to every user at once. Progressive Rollout strategies let you control who sees a change and how quickly it spreads.
2 - Progressive Rollout
Use canary, blue-green, and percentage-based deployments to reduce deployment risk.
Phase 4 - Deliver on Demand | Original content
Progressive rollout strategies let you deploy to production without deploying to all users simultaneously. By exposing changes to a small group first and expanding gradually, you catch problems before they affect your entire user base. This page covers the three major strategies, when to use each, and how to implement automated rollback.
Why Progressive Rollout?
Even with comprehensive tests, production-like environments, and small batch sizes, some issues only surface under real production traffic. Progressive rollout is the final safety layer: it limits the blast radius of any deployment by exposing the change to a small audience first.
This is not a replacement for testing. It is an addition. Your automated tests should catch the vast majority of issues. Progressive rollout catches the rest - the issues that depend on real user behavior, real data volumes, or real infrastructure conditions that cannot be fully replicated in test environments.
The Three Strategies
Strategy 1: Canary Deployment
A canary deployment routes a small percentage of production traffic to the new version while the majority continues to hit the old version. If the canary shows no problems, traffic is gradually shifted.
┌─────────────────┐
5% │ New Version │ ← Canary
┌──────►│ (v2) │
│ └─────────────────┘
Traffic ──────┤
│ ┌─────────────────┐
└──────►│ Old Version │ ← Stable
95% │ (v1) │
└─────────────────┘
How it works:
- Deploy the new version alongside the old version
- Route 1-5% of traffic to the new version
- Compare key metrics (error rate, latency, business metrics) between canary and stable
- If metrics are healthy, increase traffic to 25%, 50%, 100%
- If metrics degrade, route all traffic back to the old version
When to use canary:
- Changes that affect request handling (API changes, performance optimizations)
- Changes where you want to compare metrics between old and new versions
- Services with high traffic volume (you need enough canary traffic for statistical significance)
When canary is not ideal:
- Changes that affect batch processing or background jobs (no “traffic” to route)
- Very low traffic services (the canary may not get enough traffic to detect issues)
- Database schema changes (both versions must work with the same schema)
Implementation options:
| Infrastructure |
How to Route Traffic |
| Kubernetes + service mesh (Istio, Linkerd) |
Weighted routing rules in VirtualService |
| Load balancer (ALB, NGINX) |
Weighted target groups |
| CDN (CloudFront, Fastly) |
Origin routing rules |
| Application-level |
Feature flag with percentage rollout |
Strategy 2: Blue-Green Deployment
Blue-green deployment maintains two identical production environments. At any time, one (blue) serves live traffic and the other (green) is idle or staging.
BEFORE:
Traffic ──────► [Blue - v1] (ACTIVE)
[Green] (IDLE)
DEPLOY:
Traffic ──────► [Blue - v1] (ACTIVE)
[Green - v2] (DEPLOYING / SMOKE TESTING)
SWITCH:
Traffic ──────► [Green - v2] (ACTIVE)
[Blue - v1] (STANDBY / ROLLBACK TARGET)
How it works:
- Deploy the new version to the idle environment (green)
- Run smoke tests against green to verify basic functionality
- Switch the router/load balancer to point all traffic at green
- Keep blue running as an instant rollback target
- After a stability period, repurpose blue for the next deployment
When to use blue-green:
- You need instant, complete rollback (switch the router back)
- You want to test the deployment in a full production environment before routing traffic
- Your infrastructure supports running two parallel environments cost-effectively
When blue-green is not ideal:
- Stateful applications where both environments share mutable state
- Database migrations (the new version’s schema must work for both environments during transition)
- Cost-sensitive environments (maintaining two full production environments doubles infrastructure cost)
Rollback speed: Seconds. Switching the router back is the fastest rollback mechanism available.
Strategy 3: Percentage-Based Rollout
Percentage-based rollout gradually increases the number of users who see the new version. Unlike canary (which is traffic-based), percentage rollout is typically user-based - a specific user always sees the same version during the rollout period.
Hour 0: 1% of users → v2, 99% → v1
Hour 2: 5% of users → v2, 95% → v1
Hour 8: 25% of users → v2, 75% → v1
Day 2: 50% of users → v2, 50% → v1
Day 3: 100% of users → v2
How it works:
- Enable the new version for a small percentage of users (using feature flags or infrastructure routing)
- Monitor metrics for the affected group
- Gradually increase the percentage over hours or days
- At any point, reduce the percentage back to 0% if issues are detected
When to use percentage rollout:
- User-facing feature changes where you want consistent user experience (a user always sees v1 or v2, not a random mix)
- Changes that benefit from A/B testing data (compare user behavior between groups)
- Long-running rollouts where you want to collect business metrics before full exposure
When percentage rollout is not ideal:
- Backend infrastructure changes with no user-visible impact
- Changes that affect all users equally (e.g., API response format changes)
Implementation: Percentage rollout is typically implemented through Feature Flags (Level 2 or Level 3), using the user ID as the hash key to ensure consistent assignment.
Choosing the Right Strategy
| Factor |
Canary |
Blue-Green |
Percentage |
| Rollback speed |
Seconds (reroute traffic) |
Seconds (switch environments) |
Seconds (disable flag) |
| Infrastructure cost |
Low (runs alongside existing) |
High (two full environments) |
Low (same infrastructure) |
| Metric comparison |
Strong (side-by-side comparison) |
Weak (before/after only) |
Strong (group comparison) |
| User consistency |
No (each request may hit different version) |
Yes (all users see same version) |
Yes (each user sees consistent version) |
| Complexity |
Moderate |
Moderate |
Low (if you have feature flags) |
| Best for |
API changes, performance changes |
Full environment validation |
User-facing features |
Many teams use more than one strategy. A common pattern:
- Blue-green for infrastructure and platform changes
- Canary for service-level changes
- Percentage rollout for user-facing feature changes
Automated Rollback
Progressive rollout is only effective if rollback is automated. A human noticing a problem at 3 AM is not a reliable rollback mechanism.
Metrics to Monitor
Define automated rollback triggers before deploying. Common triggers:
| Metric |
Trigger Condition |
Example |
| Error rate |
Canary error rate > 2x stable error rate |
Stable: 0.1%, Canary: 0.3% -> rollback |
| Latency (p99) |
Canary p99 > 1.5x stable p99 |
Stable: 200ms, Canary: 400ms -> rollback |
| Health check |
Any health check failure |
HTTP 500 on /health -> rollback |
| Business metric |
Conversion rate drops > 5% for canary group |
10% conversion -> 4% conversion -> rollback |
| Saturation |
CPU or memory exceeds threshold |
CPU > 90% for 5 minutes -> rollback |
Automated Rollback Flow
Deploy new version
│
▼
Route 5% of traffic to new version
│
▼
Monitor for 15 minutes
│
├── Metrics healthy ──────► Increase to 25%
│ │
│ ▼
│ Monitor for 30 minutes
│ │
│ ├── Metrics healthy ──────► Increase to 100%
│ │
│ └── Metrics degraded ─────► ROLLBACK
│
└── Metrics degraded ─────► ROLLBACK
| Tool |
How It Helps |
| Argo Rollouts |
Kubernetes-native progressive delivery with automated analysis and rollback |
| Flagger |
Progressive delivery operator for Kubernetes with Istio, Linkerd, or App Mesh |
| Spinnaker |
Multi-cloud deployment platform with canary analysis |
| Custom scripts |
Query your metrics system, compare thresholds, trigger rollback via API |
The specific tool matters less than the principle: define rollback criteria before deploying, monitor automatically, and roll back without human intervention.
Implementing Progressive Rollout
Step 1: Choose Your First Strategy (Week 1)
Pick the strategy that matches your infrastructure:
- If you already have feature flags: start with percentage-based rollout
- If you have Kubernetes with a service mesh: start with canary
- If you have parallel environments: start with blue-green
Step 2: Define Rollback Criteria (Week 1)
Before your first progressive deployment:
- Identify the 3-5 metrics that define “healthy” for your service
- Define numerical thresholds for each metric
- Define the monitoring window (how long to wait before advancing)
- Document the rollback procedure (even if automated, document it for human understanding)
Step 3: Run a Manual Progressive Rollout (Week 2-3)
Before automating, run the process manually:
- Deploy to a canary or small percentage
- A team member monitors the dashboard for the defined window
- The team member decides to advance or rollback
- Document what they checked and how they decided
This manual practice builds understanding of what the automation will do.
Step 4: Automate the Rollout (Week 4+)
Replace the manual monitoring with automated checks:
- Implement metric queries that check your rollback criteria
- Implement automated traffic shifting (advance or rollback based on metrics)
- Implement alerting so the team knows when a rollback occurs
- Test the automation by intentionally deploying a known-bad change (in a controlled way)
Key Pitfalls
1. “Our canary doesn’t get enough traffic for meaningful metrics”
If your service handles 100 requests per hour, a 5% canary gets 5 requests per hour - not enough to detect problems statistically. Solutions: use a higher canary percentage (25-50%), use longer monitoring windows, or use blue-green instead (which does not require traffic splitting).
2. “We have progressive rollout but rollback is still manual”
Progressive rollout without automated rollback is half a solution. If the canary shows problems at 2 AM and nobody is watching, the damage occurs before anyone responds. Automated rollback is the essential companion to progressive rollout.
3. “We treat progressive rollout as a replacement for testing”
Progressive rollout is the last line of defense, not the first. If you are regularly catching bugs in canary that your test suite should have caught, your test suite needs improvement. Progressive rollout should catch rare, production-specific issues - not common bugs.
4. “Our rollout takes days because we’re too cautious”
A rollout that takes a week negates the benefits of continuous deployment. If your confidence in the pipeline is low enough to require a week-long rollout, the issue is pipeline quality, not rollout speed. Address the root cause through better testing and more production-like environments.
Measuring Success
| Metric |
Target |
Why It Matters |
| Automated rollbacks per month |
Low and stable |
Confirms the pipeline catches most issues before production |
| Time from deploy to full rollout |
Hours, not days |
Confirms the team has confidence in the process |
| Incidents caught by progressive rollout |
Tracked (any number) |
Confirms the progressive rollout is providing value |
| Manual interventions during rollout |
Zero |
Confirms the process is fully automated |
Next Step
With deploy on demand and progressive rollout, your technical deployment infrastructure is complete. Agentic CD explores how AI-assisted patterns can extend these practices further.
3 - Agentic CD
Extend continuous deployment with constraints and practices for AI agent-generated changes.
Phase 4 - Deliver on Demand | Adapted from MinimumCD.org
As AI coding agents become capable of generating production-ready code changes, the continuous deployment pipeline must evolve to handle agent-generated work with the same rigor applied to human-generated work - and in some cases, more rigor. Agentic CD defines the additional constraints and artifacts needed when agents contribute to the delivery pipeline.
What Is Agentic CD?
Agentic CD extends the Minimum CD framework to address a new category of contributor: AI agents that can generate, test, and propose code changes. These agents may operate autonomously (generating changes without human prompting) or collaboratively (assisting a human developer).
The core principle is simple: an agent-generated change must meet or exceed the same quality bar as a human-generated change. The pipeline does not care who wrote the code. It cares whether the code is correct, tested, and safe to deploy.
But agents introduce unique challenges that require additional constraints:
- Agents can generate changes faster than humans can review them
- Agents may lack context about organizational norms, business rules, or unstated constraints
- Agents cannot currently exercise judgment about risk in the same way humans can
- Agents may introduce subtle correctness issues that pass automated tests but violate intent
The Six First-Class Artifacts
Agentic CD defines six artifacts that must be explicitly maintained in a delivery pipeline that includes AI agents. These artifacts exist in human-driven CD as well, but they are often implicit. When agents are involved, they must be explicit.
1. Intent Description
What it is: A human-readable description of the desired change, written by a human.
Why it matters for agentic CD: The intent description is the agent’s “prompt” in the broadest sense. It defines what the change should accomplish, not how. Without a clear intent description, the agent may generate technically correct code that does not match what was needed.
Example:
Key property: The intent description is authored by a human. It is the human’s specification of what the agent should achieve. The agent does not write or modify the intent description.
2. User-Facing Behavior
What it is: A description of how the system should behave from the user’s perspective, expressed as observable outcomes.
Why it matters for agentic CD: Agents can generate code that satisfies tests but does not produce the expected user experience. User-facing behavior descriptions bridge the gap between technical correctness and user value.
Format: BDD scenarios work well here (see Small Batches):
3. Feature Description
What it is: A technical description of the feature’s architecture, dependencies, and integration points.
Why it matters for agentic CD: Agents need explicit architectural context that human developers often carry in their heads. The feature description tells the agent where the change fits in the system, what components it touches, and what constraints apply.
Example:
4. Executable Truth
What it is: Automated tests that define the correct behavior of the system. These tests are the authoritative source of truth for what the code should do.
Why it matters for agentic CD: For human developers, tests verify the code. For agent-generated code, tests also constrain the agent. If the tests are comprehensive, the agent cannot generate incorrect code that passes. If the tests are shallow, the agent can generate code that passes tests but does not satisfy the intent.
Key principle: Executable truth must be written or reviewed by a human before the agent generates the implementation. This inverts the common practice of writing tests after code. In agentic CD, the tests come first because they are the specification.
5. Implementation
What it is: The actual code that implements the feature. In agentic CD, this may be generated entirely by the agent, co-authored by agent and human, or authored by a human with agent assistance.
Why it matters for agentic CD: The implementation is the artifact most likely to be agent-generated. The key requirement is that it must satisfy the executable truth (tests), conform to the feature description (architecture), and achieve the intent description (purpose).
Review requirements: Agent-generated implementation must be reviewed by a human before merging to trunk. The review focuses on:
- Does the implementation match the intent? (Not just “does it pass tests?”)
- Does it follow the architectural constraints in the feature description?
- Does it introduce unnecessary complexity, dependencies, or security risks?
- Would a human developer on the team understand and maintain this code?
6. System Constraints
What it is: Non-functional requirements, security policies, performance budgets, and organizational rules that apply to all changes.
Why it matters for agentic CD: Human developers internalize system constraints through experience and team norms. Agents need these constraints stated explicitly.
Examples:
The Agentic CD Workflow
When an AI agent contributes to a CD pipeline, the workflow extends the standard CD pipeline:
1. HUMAN writes Intent Description
2. HUMAN writes or reviews User-Facing Behavior (BDD scenarios)
3. HUMAN writes or reviews Feature Description (architecture)
4. HUMAN writes or reviews Executable Truth (tests)
5. AGENT generates Implementation (code)
6. PIPELINE validates Implementation against Executable Truth (automated tests)
7. HUMAN reviews Implementation (code review)
8. PIPELINE deploys (same pipeline as any other change)
Key differences from standard CD:
- Steps 1-4 happen before the agent generates code (test-first is mandatory, not optional)
- Step 7 (human review) is mandatory for agent-generated code
- System constraints are checked automatically in the pipeline (Step 6)
Constraints for Agent-Generated Changes
Beyond the six artifacts, agentic CD imposes additional constraints on agent-generated changes:
Change Size Limits
Agent-generated changes must be small. Large agent-generated changes are harder to review and more likely to contain subtle issues.
Guideline: An agent-generated change should modify no more files and no more lines than a human would in a single commit. If the change is larger, break it into multiple sequential changes.
Mandatory Human Review
Every agent-generated change must be reviewed by a human before merging to trunk. This is a non-negotiable constraint. The purpose is not to check the agent’s “work” in a supervisory sense - it is to verify that the change matches the intent and fits the system.
Comprehensive Test Coverage
Agent-generated code must have higher test coverage than the team’s baseline. If the team’s baseline is 80% coverage, agent-generated code should target 90%+. This compensates for the reduced human oversight of the implementation details.
Provenance Tracking
The pipeline must record which changes were agent-generated, which agent generated them, and what prompt or intent description was used. This supports audit, debugging, and learning.
Getting Started with Agentic CD
Before jumping into agentic workflows, ensure your team has the prerequisite delivery practices
in place. The AI Adoption Roadmap provides a
step-by-step sequence: quality tools, clear requirements, hardened guardrails, and reduced
delivery friction - all before accelerating with AI coding.
Phase 1: Agent as Assistant
The agent helps human developers write code, but the human makes all decisions and commits all changes. The pipeline does not know or care about agent involvement.
This is where most teams are today. It requires no pipeline changes.
Phase 2: Agent as Contributor
The agent generates complete changes based on intent descriptions and executable truth. A human reviews and merges. The pipeline validates.
Requires: Explicit intent descriptions, test-first workflow, human review gate.
Phase 3: Agent as Autonomous Contributor
The agent generates, tests, and proposes changes with minimal human involvement. Human review is still mandatory, but the agent handles the full cycle from intent to implementation.
Requires: All six first-class artifacts, comprehensive system constraints, provenance tracking, and high confidence in the executable truth.
Key Pitfalls
1. “We let the agent generate tests and code together”
If the agent writes both the tests and the code, the tests may be designed to pass the code rather than to verify the intent. Tests must be written or reviewed by a human before the agent generates the implementation. This is the most important constraint in agentic CD.
2. “The agent generates changes faster than we can review them”
This is a feature, not a bug - but only if you have the discipline to not merge unreviewed changes. The agent’s speed should not pressure humans to review faster. WIP limits apply: if the review queue is full, the agent stops generating new changes.
3. “We trust the agent because it passed the tests”
Passing tests is necessary but not sufficient. Tests cannot verify intent, architectural fitness, or maintainability. Human review remains mandatory.
4. “We don’t track which changes are agent-generated”
Without provenance tracking, you cannot learn from agent-generated failures, audit agent behavior, or improve the agent’s constraints over time. Track provenance from the start.
Measuring Success
| Metric |
Target |
Why It Matters |
| Agent-generated change failure rate |
Equal to or lower than human-generated |
Confirms agent changes meet the same quality bar |
| Review time for agent-generated changes |
Comparable to human-generated changes |
Confirms changes are reviewable, not rubber-stamped |
| Test coverage for agent-generated code |
Higher than baseline |
Confirms the additional coverage constraint is met |
| Agent-generated changes with complete artifacts |
100% |
Confirms the six-artifact workflow is being followed |
Next Step
For real-world examples of teams that have made the full journey to continuous deployment, see Experience Reports.
This content is adapted from MinimumCD.org,
licensed under CC BY 4.0.
4 - Experience Reports
Real-world stories from teams that have made the journey to continuous deployment.
Phase 4 - Deliver on Demand | Adapted from MinimumCD.org
Theory is necessary but insufficient. This page collects experience reports from organizations that have adopted continuous deployment at scale, including the challenges they faced, the approaches they took, and the results they achieved. These reports demonstrate that CD is not limited to startups or greenfield projects - it works in large, complex, regulated environments.
Why Experience Reports Matter
Every team considering continuous deployment faces the same objection: “That works for [Google / Netflix / small startups], but our situation is different.” Experience reports counter this objection with evidence. They show that organizations of every size, in every industry, with every kind of legacy system, have found a path to continuous deployment.
No experience report will match your situation exactly. That is not the point. The point is to extract patterns: what obstacles did these teams encounter, and how did they overcome them?
Walmart: CD at Retail Scale
Context
Walmart operates one of the world’s largest e-commerce platforms alongside its massive physical retail infrastructure. Changes to the platform affect millions of transactions per day. The organization had a traditional release process with weekly deployment windows and multi-stage manual approval.
The Challenge
- Scale: Thousands of developers across hundreds of teams
- Risk tolerance: Any outage affects revenue in real time
- Legacy: Decades of existing systems with deep interdependencies
- Regulation: PCI compliance requirements for payment processing
What They Did
- Invested in a centralized deployment platform (OneOps, later Concord) that standardized the deployment pipeline across all teams
- Broke the monolithic release into independent service deployments
- Implemented automated canary analysis for every deployment
- Moved from weekly release trains to on-demand deployment per team
Key Lessons
- Platform investment pays off. Building a shared deployment platform let hundreds of teams adopt CD without each team solving the same infrastructure problems.
- Compliance and CD are compatible. Automated pipelines with full audit trails satisfied PCI requirements more reliably than manual approval processes.
- Cultural change is harder than technical change. Teams that had operated on weekly release cycles for years needed coaching and support to trust automated deployment.
Microsoft: From Waterfall to Daily Deploys
Context
Microsoft’s Azure DevOps (formerly Visual Studio Team Services) team made a widely documented transformation from 3-year waterfall releases to deploying multiple times per day. This transformation happened within one of the largest software organizations in the world.
The Challenge
- History: Decades of waterfall development culture
- Product complexity: A platform used by millions of developers
- Organizational size: Thousands of engineers across multiple time zones
- Customer expectations: Enterprise customers expected stability and predictability
What They Did
- Broke the product into independently deployable services (ring-based deployment)
- Implemented a ring-based rollout: Ring 0 (team), Ring 1 (internal Microsoft users), Ring 2 (select external users), Ring 3 (all users)
- Invested heavily in automated testing, achieving thousands of tests running in minutes
- Moved from a fixed release cadence to continuous deployment with feature flags controlling release
- Used telemetry to detect issues in real-time and automated rollback when metrics degraded
Key Lessons
- Ring-based deployment is progressive rollout. Microsoft’s ring model is an implementation of the progressive rollout strategies described in this guide.
- Feature flags enabled decoupling. By deploying frequently but releasing features incrementally via flags, the team could deploy without worrying about feature completeness.
- The transformation took years, not months. Moving from 3-year cycles to daily deployment was a multi-year journey with incremental progress at each step.
Google: Engineering Productivity at Scale
Context
Google is often cited as the canonical example of continuous deployment, deploying changes to production thousands of times per day across its vast service portfolio.
The Challenge
- Scale: Billions of users, millions of servers
- Monorepo: Most of Google operates from a single repository with billions of lines of code
- Interdependencies: Changes in shared libraries can affect thousands of services
- Velocity: Thousands of engineers committing changes every day
What They Did
- Built a culture of automated testing where tests are a first-class deliverable, not an afterthought
- Implemented a submit queue that runs automated tests on every change before it merges to the trunk
- Invested in build infrastructure (Blaze/Bazel) that can build and test only the affected portions of the codebase
- Used percentage-based rollout for user-facing changes
- Made rollback a one-click operation available to every team
Key Lessons
- Test infrastructure is critical infrastructure. Google’s ability to deploy frequently depends entirely on its ability to test quickly and reliably.
- Monorepo and CD are compatible. The common assumption that CD requires microservices with separate repos is false. Google deploys from a monorepo.
- Invest in tooling before process. Google built the tooling (build systems, test infrastructure, deployment automation) that made good practices the path of least resistance.
Amazon: Two-Pizza Teams and Ownership
Context
Amazon’s transformation to service-oriented architecture and team ownership is one of the most influential in the industry. The “two-pizza team” model and “you build it, you run it” philosophy directly enabled continuous deployment.
The Challenge
- Organizational size: Hundreds of thousands of employees
- System complexity: Thousands of services powering amazon.com and AWS
- Availability requirements: Even brief outages are front-page news
- Pace of innovation: Competitive pressure demands rapid feature delivery
What They Did
- Decomposed the system into independently deployable services, each owned by a small team
- Gave teams full ownership: build, test, deploy, operate, and support
- Built internal deployment tooling (Apollo) that automates canary analysis, rollback, and one-click deployment
- Established the practice of deploying every commit that passes the pipeline, with automated rollback on metric degradation
Key Lessons
- Ownership drives quality. When the team that writes the code also operates it in production, they write better code and build better monitoring.
- Small teams move faster. Two-pizza teams (6-10 people) can make decisions without bureaucratic overhead.
- Automation eliminates toil. Amazon’s internal deployment tooling means that deploying is not a skilled activity - any team member can deploy (and the pipeline usually deploys automatically).
HP: CD in Hardware-Adjacent Software
Context
HP’s LaserJet firmware team demonstrated that continuous delivery principles apply even to embedded software, a domain often considered incompatible with frequent deployment.
The Challenge
- Embedded software: Firmware that runs on physical printers
- Long development cycles: Firmware releases had traditionally been annual
- Quality requirements: Firmware bugs require physical recalls or complex update procedures
- Team size: Large, distributed teams with varying skill levels
What They Did
- Invested in automated testing infrastructure for firmware
- Reduced build times from days to under an hour
- Moved from annual releases to frequent incremental updates
- Implemented continuous integration with automated test suites running on simulator and hardware
Key Lessons
- CD principles are universal. Even embedded firmware can benefit from small batches, automated testing, and continuous integration.
- Build time is a critical constraint. Reducing build time from days to under an hour unlocked the ability to test frequently, which enabled frequent integration, which enabled frequent delivery.
- Results were dramatic: Development costs reduced by approximately 40%, programs delivered on schedule increased by roughly 140%.
Flickr: “10+ Deploys Per Day”
Context
Flickr’s 2009 presentation “10+ Deploys Per Day: Dev and Ops Cooperation” is credited with helping launch the DevOps movement. At a time when most organizations deployed quarterly, Flickr was deploying more than ten times per day.
The Challenge
- Web-scale service: Serving billions of photos to millions of users
- Ops/Dev divide: Traditional separation between development and operations teams
- Fear of change: Deployments were infrequent because they were risky
What They Did
- Built automated infrastructure provisioning and deployment
- Implemented feature flags to decouple deployment from release
- Created a culture of shared responsibility between development and operations
- Made deployment a routine, low-ceremony event that anyone could trigger
- Used IRC bots (and later chat-based tools) to coordinate and log deployments
Key Lessons
- Culture is the enabler. Flickr’s technical practices were important, but the cultural shift - developers and operations working together, shared responsibility, mutual respect - was what made frequent deployment possible.
- Tooling should reduce friction. Flickr’s deployment tools were designed to make deploying as easy as possible. The easier it is to deploy, the more often people deploy, and the smaller each deployment becomes.
- Transparency builds trust. Logging every deployment in a shared channel let everyone see what was deploying, who deployed it, and whether it caused problems. This transparency built organizational trust in frequent deployment.
Common Patterns Across Reports
Despite the diversity of these organizations, several patterns emerge consistently:
1. Investment in Automation Precedes Cultural Change
Every organization built the tooling first. Automated testing, automated deployment, automated rollback - these created the conditions where frequent deployment was possible. Cultural change followed when people saw that the automation worked.
2. Incremental Adoption, Not Big Bang
No organization switched to continuous deployment overnight. They all moved incrementally: shorter release cycles first, then weekly deploys, then daily, then on-demand. Each step built confidence for the next.
3. Team Ownership Is Essential
Organizations that gave teams ownership of their deployments (build it, run it) moved faster than those that kept deployment as a centralized function. Ownership creates accountability, which drives quality.
4. Feature Flags Are Universal
Every organization in these reports uses feature flags to decouple deployment from release. This is not optional for continuous deployment - it is foundational.
5. The Results Are Consistent
Regardless of industry, size, or starting point, organizations that adopt continuous deployment consistently report:
- Higher deployment frequency (daily or more)
- Lower change failure rate (small changes fail less)
- Faster recovery (automated rollback, small blast radius)
- Higher developer satisfaction (less toil, more impact)
- Better business outcomes (faster time to market, reduced costs)
Applying These Lessons to Your Migration
You do not need to be Google-sized to benefit from these patterns. Extract what applies:
- Start with automation. Build the pipeline, the tests, the rollback mechanism.
- Adopt incrementally. Move from monthly to weekly to daily. Do not try to jump to 10 deploys per day on day one.
- Give teams ownership. Let teams deploy their own services.
- Use feature flags. Decouple deployment from release.
- Measure and improve. Track DORA metrics. Run experiments. Use retrospectives.
These are the practices covered throughout this migration guide. The experience reports confirm that they work - not in theory, but in production, at scale, in the real world.
Further Reading
For detailed experience reports and additional case studies, see:
- MinimumCD.org Experience Reports - Collected reports from organizations practicing minimum CD
- Accelerate by Nicole Forsgren, Jez Humble, and Gene Kim - The research behind DORA metrics, with extensive case study data
- Continuous Delivery by Jez Humble and David Farley - The foundational text, with detailed examples from multiple organizations
- The DevOps Handbook by Gene Kim, Jez Humble, Patrick Debois, and John Willis - Case studies from organizations across industries
This content is adapted from MinimumCD.org,
licensed under CC BY 4.0.