This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Phase 4: Deliver on Demand

The capability to deploy any change to production at any time, using the delivery strategy that fits your context.

Key question: “Can we deliver any change to production when the business needs it?”

This is the destination: you can deploy any change that passes the pipeline to production whenever you choose. Some teams will auto-deploy every commit (continuous deployment). Others will deploy on demand when the business is ready. Both are valid - the capability is what matters, not the trigger.

What You’ll Do

  1. Deploy on demand - Remove the last manual gates so any green build can reach production
  2. Use progressive rollout - Canary, blue-green, and percentage-based deployments
  3. Explore agentic CD - AI-assisted continuous delivery patterns
  4. Learn from experience reports - How other teams made the journey

Continuous Delivery vs. Continuous Deployment

These terms are often confused. The distinction matters for this phase:

  • Continuous delivery means every commit that passes the pipeline could be deployed to production at any time. The capability exists. A human or business process decides when.
  • Continuous deployment means every commit that passes the pipeline is deployed to production automatically. No human decision is involved.

Continuous delivery is the goal of this migration guide. Continuous deployment is one delivery strategy that works well for certain contexts - SaaS products, internal tools, services behind feature flags. It is not a higher level of maturity. A team that deploys on demand with a one-click deploy is just as capable as a team that auto-deploys every commit.

Why This Phase Matters

When your foundations are solid, your pipeline is reliable, and your batch sizes are small, deploying any change becomes low-risk. The remaining barriers are organizational, not technical: approval processes, change windows, release coordination. This phase addresses those barriers so the team has the option to deploy whenever the business needs it.

Signs You’ve Arrived

  • Any commit that passes the pipeline can reach production within minutes
  • The team deploys frequently (daily or more) with no drama
  • Mean time to recovery is measured in minutes
  • The team has confidence that any deployment can be safely rolled back
  • New team members can deploy on their first day
  • The deployment strategy (on-demand or automatic) is a team choice, not a constraint

1 - Deploy on Demand

Remove the last manual gates and deploy every change that passes the pipeline.

Phase 4 - Deliver on Demand | Original content

Deploy on demand means that any change which passes the full automated pipeline can reach production without waiting for a human to press a button, open a ticket, or schedule a window. This page covers the prerequisites, the transition from continuous delivery to continuous deployment, and how to address the organizational concerns that are the real barriers.

Continuous Delivery vs. Continuous Deployment

These two terms are often confused. The distinction matters:

  • Continuous Delivery: Every commit that passes the pipeline could be deployed to production. A human decides when to deploy.
  • Continuous Deployment: Every commit that passes the pipeline is deployed to production. No human decision is required.

If you have completed Phases 1-3 of this migration, you have continuous delivery. This page is about removing that last manual decision and moving to continuous deployment.

Why Remove the Last Gate?

The manual deployment decision feels safe. It gives someone a chance to “eyeball” the change before it goes to production. In practice, it does the opposite.

The Problems with Manual Gates

Problem Why It Happens Impact
Batching If deploys are manual, teams batch changes to reduce the number of deploy events Larger batches increase risk and make rollback harder
Delay Changes wait for someone to approve, which may take hours or days Longer lead time, delayed feedback
False confidence The approver cannot meaningfully review what the automated pipeline already tested The gate provides the illusion of safety without actual safety
Bottleneck One person or team becomes the deploy gatekeeper Creates a single point of failure for the entire delivery flow
Deploy fear Infrequent deploys mean each deploy is higher stakes Teams become more cautious, batches get larger, risk increases

The Paradox of Manual Safety

The more you rely on manual deployment gates, the less safe your deployments become. This is because manual gates lead to batching, batching increases risk, and increased risk justifies more manual gates. It is a vicious cycle.

Continuous deployment breaks this cycle. Small, frequent, automated deployments are individually low-risk. If one fails, the blast radius is small and recovery is fast.

Prerequisites for Deploy on Demand

Before removing manual gates, verify that these conditions are met. Each one is covered in earlier phases of this migration.

Non-Negotiable Prerequisites

Prerequisite What It Means Where to Build It
Comprehensive automated tests The test suite catches real defects, not just trivial cases Testing Fundamentals
Fast, reliable pipeline The pipeline completes in under 15 minutes and rarely fails for non-code reasons Deterministic Pipeline
Automated rollback You can roll back a bad deployment in minutes without manual intervention Rollback
Feature flags Incomplete features are hidden from users via flags, not deployment timing Feature Flags
Small batch sizes Each deployment contains 1-3 small changes, not dozens Small Batches
Production-like environments Test environments match production closely enough that test results are trustworthy Production-Like Environments
Observability You can detect production issues within minutes through monitoring and alerting Metrics-Driven Improvement

Assessment: Are You Ready?

Answer these questions honestly:

  1. When was the last time your pipeline caught a real bug? If the answer is “I don’t remember,” your test suite may not be trustworthy enough.
  2. How long does a rollback take? If the answer is more than 15 minutes, automate it first.
  3. Do deploys ever fail for non-code reasons? (Environment issues, credential problems, network flakiness.) If yes, stabilize your pipeline first.
  4. Does the team trust the pipeline? If team members regularly say “let me check one more thing before we deploy,” trust is not there yet. Build it through retrospectives and transparent metrics.

The Transition: Three Approaches

Approach 1: Shadow Mode

Run continuous deployment alongside manual deployment. Every change that passes the pipeline is automatically deployed to a shadow production environment (or a canary group). A human still approves the “real” production deployment.

Duration: 2-4 weeks.

What you learn: How often the automated deployment would have been correct. If the answer is “every time” (or close to it), the manual gate is not adding value.

Transition: Once the team sees that the shadow deployments are consistently safe, remove the manual gate.

Approach 2: Opt-In per Team

Allow individual teams to adopt continuous deployment while others continue with manual gates. This works well in organizations with multiple teams at different maturity levels.

Duration: Ongoing. Teams opt in when they are ready.

What you learn: Which teams are ready and which need more foundation work. Early adopters demonstrate the pattern for the rest of the organization.

Transition: As more teams succeed, continuous deployment becomes the default. Remaining teams are supported in reaching readiness.

Approach 3: Direct Switchover

Remove the manual gate for all teams at once. This is appropriate when the organization has high confidence in its pipeline and all teams have completed Phases 1-3.

Duration: Immediate.

What you learn: Quickly reveals any hidden dependencies on the manual gate (e.g., deploy coordination between teams, configuration changes that ride along with deployments).

Transition: Be prepared to temporarily revert if unforeseen issues arise. Have a clear rollback plan for the process change itself.

Addressing Organizational Concerns

The technical prerequisites are usually met before the organizational ones. These are the conversations you will need to have.

“What about change management / ITIL?”

Change management frameworks like ITIL define a “standard change” category: a pre-approved, low-risk, well-understood change that does not require a Change Advisory Board (CAB) review. Continuous deployment changes qualify as standard changes because they are:

  • Small (one to a few commits)
  • Automated (same pipeline every time)
  • Reversible (automated rollback)
  • Well-tested (comprehensive automated tests)

Work with your change management team to classify pipeline-passing deployments as standard changes. This preserves the governance framework while removing the bottleneck.

“What about compliance and audit?”

Continuous deployment does not eliminate audit trails - it strengthens them. Every deployment is:

  • Traceable: Tied to a specific commit, which is tied to a specific story or ticket
  • Reproducible: The same pipeline produces the same result every time
  • Recorded: Pipeline logs capture every test that passed, every approval that was automated
  • Reversible: Rollback history shows when and why a deployment was reverted

Provide auditors with access to pipeline logs, deployment history, and the automated test suite. This is a more complete audit trail than a manual approval signature.

“What about database migrations?”

Database migrations require special care in continuous deployment because they cannot be rolled back as easily as code changes.

Rules for database migrations in CD:

  1. Migrations must be backward-compatible. The previous version of the code must work with the new schema.
  2. Use expand/contract pattern. First deploy the new column/table (expand). Then deploy the code that uses it. Then remove the old column/table (contract). Each step is a separate deployment.
  3. Never drop a column in the same deployment that stops using it. There is always a window where both old and new code run simultaneously.
  4. Test migrations in production-like environments before they reach production.

“What if we deploy a breaking change?”

This is why you have automated rollback and observability. The sequence is:

  1. Deployment happens automatically
  2. Monitoring detects an issue (error rate spike, latency increase, health check failure)
  3. Automated rollback triggers (or on-call engineer triggers manual rollback)
  4. The team investigates and fixes the issue
  5. The fix goes through the pipeline and deploys automatically

The key insight: this sequence takes minutes with continuous deployment. With manual deployment on a weekly schedule, the same breaking change would take days to detect and fix.

After the Transition

What Changes for the Team

Before After
“Are we deploying today?” Deploys happen automatically, all the time
“Who’s doing the deploy?” Nobody - the pipeline does it
“Can I get this into the next release?” Every merge to trunk is the next release
“We need to coordinate the deploy with team X” Teams deploy independently
“Let’s wait for the deploy window” There are no deploy windows

What Stays the Same

  • Code review still happens (before merge to trunk)
  • Automated tests still run (in the pipeline)
  • Feature flags still control feature visibility (decoupling deploy from release)
  • Monitoring still catches issues (but now recovery is faster)
  • The team still owns its deployments (but the manual step is gone)

The First Week

The first week of continuous deployment will feel uncomfortable. This is normal. The team will instinctively want to “check” deployments that happen automatically. Resist the urge to add manual checks back. Instead:

  • Watch the monitoring dashboards more closely than usual
  • Have the team discuss each automatic deployment in standup for the first week
  • Celebrate the first deployment that goes out without anyone noticing - that is the goal

Key Pitfalls

1. “We adopted continuous deployment but kept the approval step ‘just in case’”

If the approval step exists, it will be used, and you have not actually adopted continuous deployment. Remove the gate completely. If something goes wrong, use rollback - do not use a pre-deployment gate.

2. “Our deploy cadence didn’t actually increase”

Continuous deployment only increases deploy frequency if the team is integrating to trunk frequently. If the team still merges weekly, they will deploy weekly - automatically, but still weekly. Revisit Trunk-Based Development and Small Batches.

3. “We have continuous deployment for the application but not the database/infrastructure”

Partial continuous deployment creates a split experience: application changes flow freely but infrastructure changes still require manual coordination. Extend the pipeline to cover infrastructure as code, database migrations, and configuration changes.

Measuring Success

Metric Target Why It Matters
Deployment frequency Multiple per day Confirms the pipeline is deploying every change
Lead time < 1 hour from commit to production Confirms no manual gates are adding delay
Manual interventions per deploy Zero Confirms the process is fully automated
Change failure rate Stable or improving Confirms automation is not introducing new failures
MTTR < 15 minutes Confirms automated rollback is working

Next Step

Continuous deployment deploys every change, but not every change needs to go to every user at once. Progressive Rollout strategies let you control who sees a change and how quickly it spreads.

2 - Progressive Rollout

Use canary, blue-green, and percentage-based deployments to reduce deployment risk.

Phase 4 - Deliver on Demand | Original content

Progressive rollout strategies let you deploy to production without deploying to all users simultaneously. By exposing changes to a small group first and expanding gradually, you catch problems before they affect your entire user base. This page covers the three major strategies, when to use each, and how to implement automated rollback.

Why Progressive Rollout?

Even with comprehensive tests, production-like environments, and small batch sizes, some issues only surface under real production traffic. Progressive rollout is the final safety layer: it limits the blast radius of any deployment by exposing the change to a small audience first.

This is not a replacement for testing. It is an addition. Your automated tests should catch the vast majority of issues. Progressive rollout catches the rest - the issues that depend on real user behavior, real data volumes, or real infrastructure conditions that cannot be fully replicated in test environments.

The Three Strategies

Strategy 1: Canary Deployment

A canary deployment routes a small percentage of production traffic to the new version while the majority continues to hit the old version. If the canary shows no problems, traffic is gradually shifted.

                        ┌─────────────────┐
                   5%   │  New Version     │  ← Canary
                ┌──────►│  (v2)            │
                │       └─────────────────┘
  Traffic ──────┤
                │       ┌─────────────────┐
                └──────►│  Old Version     │  ← Stable
                  95%   │  (v1)            │
                        └─────────────────┘

How it works:

  1. Deploy the new version alongside the old version
  2. Route 1-5% of traffic to the new version
  3. Compare key metrics (error rate, latency, business metrics) between canary and stable
  4. If metrics are healthy, increase traffic to 25%, 50%, 100%
  5. If metrics degrade, route all traffic back to the old version

When to use canary:

  • Changes that affect request handling (API changes, performance optimizations)
  • Changes where you want to compare metrics between old and new versions
  • Services with high traffic volume (you need enough canary traffic for statistical significance)

When canary is not ideal:

  • Changes that affect batch processing or background jobs (no “traffic” to route)
  • Very low traffic services (the canary may not get enough traffic to detect issues)
  • Database schema changes (both versions must work with the same schema)

Implementation options:

Infrastructure How to Route Traffic
Kubernetes + service mesh (Istio, Linkerd) Weighted routing rules in VirtualService
Load balancer (ALB, NGINX) Weighted target groups
CDN (CloudFront, Fastly) Origin routing rules
Application-level Feature flag with percentage rollout

Strategy 2: Blue-Green Deployment

Blue-green deployment maintains two identical production environments. At any time, one (blue) serves live traffic and the other (green) is idle or staging.

  BEFORE:
    Traffic ──────► [Blue - v1] (ACTIVE)
                    [Green]     (IDLE)

  DEPLOY:
    Traffic ──────► [Blue - v1] (ACTIVE)
                    [Green - v2] (DEPLOYING / SMOKE TESTING)

  SWITCH:
    Traffic ──────► [Green - v2] (ACTIVE)
                    [Blue - v1]  (STANDBY / ROLLBACK TARGET)

How it works:

  1. Deploy the new version to the idle environment (green)
  2. Run smoke tests against green to verify basic functionality
  3. Switch the router/load balancer to point all traffic at green
  4. Keep blue running as an instant rollback target
  5. After a stability period, repurpose blue for the next deployment

When to use blue-green:

  • You need instant, complete rollback (switch the router back)
  • You want to test the deployment in a full production environment before routing traffic
  • Your infrastructure supports running two parallel environments cost-effectively

When blue-green is not ideal:

  • Stateful applications where both environments share mutable state
  • Database migrations (the new version’s schema must work for both environments during transition)
  • Cost-sensitive environments (maintaining two full production environments doubles infrastructure cost)

Rollback speed: Seconds. Switching the router back is the fastest rollback mechanism available.

Strategy 3: Percentage-Based Rollout

Percentage-based rollout gradually increases the number of users who see the new version. Unlike canary (which is traffic-based), percentage rollout is typically user-based - a specific user always sees the same version during the rollout period.

  Hour 0:   1% of users  → v2,  99% → v1
  Hour 2:   5% of users  → v2,  95% → v1
  Hour 8:  25% of users  → v2,  75% → v1
  Day 2:   50% of users  → v2,  50% → v1
  Day 3:  100% of users  → v2

How it works:

  1. Enable the new version for a small percentage of users (using feature flags or infrastructure routing)
  2. Monitor metrics for the affected group
  3. Gradually increase the percentage over hours or days
  4. At any point, reduce the percentage back to 0% if issues are detected

When to use percentage rollout:

  • User-facing feature changes where you want consistent user experience (a user always sees v1 or v2, not a random mix)
  • Changes that benefit from A/B testing data (compare user behavior between groups)
  • Long-running rollouts where you want to collect business metrics before full exposure

When percentage rollout is not ideal:

  • Backend infrastructure changes with no user-visible impact
  • Changes that affect all users equally (e.g., API response format changes)

Implementation: Percentage rollout is typically implemented through Feature Flags (Level 2 or Level 3), using the user ID as the hash key to ensure consistent assignment.

Choosing the Right Strategy

Factor Canary Blue-Green Percentage
Rollback speed Seconds (reroute traffic) Seconds (switch environments) Seconds (disable flag)
Infrastructure cost Low (runs alongside existing) High (two full environments) Low (same infrastructure)
Metric comparison Strong (side-by-side comparison) Weak (before/after only) Strong (group comparison)
User consistency No (each request may hit different version) Yes (all users see same version) Yes (each user sees consistent version)
Complexity Moderate Moderate Low (if you have feature flags)
Best for API changes, performance changes Full environment validation User-facing features

Many teams use more than one strategy. A common pattern:

  • Blue-green for infrastructure and platform changes
  • Canary for service-level changes
  • Percentage rollout for user-facing feature changes

Automated Rollback

Progressive rollout is only effective if rollback is automated. A human noticing a problem at 3 AM is not a reliable rollback mechanism.

Metrics to Monitor

Define automated rollback triggers before deploying. Common triggers:

Metric Trigger Condition Example
Error rate Canary error rate > 2x stable error rate Stable: 0.1%, Canary: 0.3% -> rollback
Latency (p99) Canary p99 > 1.5x stable p99 Stable: 200ms, Canary: 400ms -> rollback
Health check Any health check failure HTTP 500 on /health -> rollback
Business metric Conversion rate drops > 5% for canary group 10% conversion -> 4% conversion -> rollback
Saturation CPU or memory exceeds threshold CPU > 90% for 5 minutes -> rollback

Automated Rollback Flow

Deploy new version
       │
       ▼
Route 5% of traffic to new version
       │
       ▼
Monitor for 15 minutes
       │
       ├── Metrics healthy ──────► Increase to 25%
       │                                │
       │                                ▼
       │                          Monitor for 30 minutes
       │                                │
       │                                ├── Metrics healthy ──────► Increase to 100%
       │                                │
       │                                └── Metrics degraded ─────► ROLLBACK
       │
       └── Metrics degraded ─────► ROLLBACK

Implementation Tools

Tool How It Helps
Argo Rollouts Kubernetes-native progressive delivery with automated analysis and rollback
Flagger Progressive delivery operator for Kubernetes with Istio, Linkerd, or App Mesh
Spinnaker Multi-cloud deployment platform with canary analysis
Custom scripts Query your metrics system, compare thresholds, trigger rollback via API

The specific tool matters less than the principle: define rollback criteria before deploying, monitor automatically, and roll back without human intervention.

Implementing Progressive Rollout

Step 1: Choose Your First Strategy (Week 1)

Pick the strategy that matches your infrastructure:

  • If you already have feature flags: start with percentage-based rollout
  • If you have Kubernetes with a service mesh: start with canary
  • If you have parallel environments: start with blue-green

Step 2: Define Rollback Criteria (Week 1)

Before your first progressive deployment:

  1. Identify the 3-5 metrics that define “healthy” for your service
  2. Define numerical thresholds for each metric
  3. Define the monitoring window (how long to wait before advancing)
  4. Document the rollback procedure (even if automated, document it for human understanding)

Step 3: Run a Manual Progressive Rollout (Week 2-3)

Before automating, run the process manually:

  1. Deploy to a canary or small percentage
  2. A team member monitors the dashboard for the defined window
  3. The team member decides to advance or rollback
  4. Document what they checked and how they decided

This manual practice builds understanding of what the automation will do.

Step 4: Automate the Rollout (Week 4+)

Replace the manual monitoring with automated checks:

  1. Implement metric queries that check your rollback criteria
  2. Implement automated traffic shifting (advance or rollback based on metrics)
  3. Implement alerting so the team knows when a rollback occurs
  4. Test the automation by intentionally deploying a known-bad change (in a controlled way)

Key Pitfalls

1. “Our canary doesn’t get enough traffic for meaningful metrics”

If your service handles 100 requests per hour, a 5% canary gets 5 requests per hour - not enough to detect problems statistically. Solutions: use a higher canary percentage (25-50%), use longer monitoring windows, or use blue-green instead (which does not require traffic splitting).

2. “We have progressive rollout but rollback is still manual”

Progressive rollout without automated rollback is half a solution. If the canary shows problems at 2 AM and nobody is watching, the damage occurs before anyone responds. Automated rollback is the essential companion to progressive rollout.

3. “We treat progressive rollout as a replacement for testing”

Progressive rollout is the last line of defense, not the first. If you are regularly catching bugs in canary that your test suite should have caught, your test suite needs improvement. Progressive rollout should catch rare, production-specific issues - not common bugs.

4. “Our rollout takes days because we’re too cautious”

A rollout that takes a week negates the benefits of continuous deployment. If your confidence in the pipeline is low enough to require a week-long rollout, the issue is pipeline quality, not rollout speed. Address the root cause through better testing and more production-like environments.

Measuring Success

Metric Target Why It Matters
Automated rollbacks per month Low and stable Confirms the pipeline catches most issues before production
Time from deploy to full rollout Hours, not days Confirms the team has confidence in the process
Incidents caught by progressive rollout Tracked (any number) Confirms the progressive rollout is providing value
Manual interventions during rollout Zero Confirms the process is fully automated

Next Step

With deploy on demand and progressive rollout, your technical deployment infrastructure is complete. Agentic CD explores how AI-assisted patterns can extend these practices further.

3 - Agentic CD

Extend continuous deployment with constraints and practices for AI agent-generated changes.

Phase 4 - Deliver on Demand | Adapted from MinimumCD.org

As AI coding agents become capable of generating production-ready code changes, the continuous deployment pipeline must evolve to handle agent-generated work with the same rigor applied to human-generated work - and in some cases, more rigor. Agentic CD defines the additional constraints and artifacts needed when agents contribute to the delivery pipeline.

What Is Agentic CD?

Agentic CD extends the Minimum CD framework to address a new category of contributor: AI agents that can generate, test, and propose code changes. These agents may operate autonomously (generating changes without human prompting) or collaboratively (assisting a human developer).

The core principle is simple: an agent-generated change must meet or exceed the same quality bar as a human-generated change. The pipeline does not care who wrote the code. It cares whether the code is correct, tested, and safe to deploy.

But agents introduce unique challenges that require additional constraints:

  • Agents can generate changes faster than humans can review them
  • Agents may lack context about organizational norms, business rules, or unstated constraints
  • Agents cannot currently exercise judgment about risk in the same way humans can
  • Agents may introduce subtle correctness issues that pass automated tests but violate intent

The Six First-Class Artifacts

Agentic CD defines six artifacts that must be explicitly maintained in a delivery pipeline that includes AI agents. These artifacts exist in human-driven CD as well, but they are often implicit. When agents are involved, they must be explicit.

1. Intent Description

What it is: A human-readable description of the desired change, written by a human.

Why it matters for agentic CD: The intent description is the agent’s “prompt” in the broadest sense. It defines what the change should accomplish, not how. Without a clear intent description, the agent may generate technically correct code that does not match what was needed.

Example:

## Intent: Add rate limiting to the /api/search endpoint

We are receiving complaints about slow response times during peak hours.
Analysis shows that a small number of clients are making thousands of
requests per minute. We need to limit each authenticated client to 100
requests per minute on the /api/search endpoint. Requests that exceed
the limit should receive a 429 response with a Retry-After header.

Key property: The intent description is authored by a human. It is the human’s specification of what the agent should achieve. The agent does not write or modify the intent description.

2. User-Facing Behavior

What it is: A description of how the system should behave from the user’s perspective, expressed as observable outcomes.

Why it matters for agentic CD: Agents can generate code that satisfies tests but does not produce the expected user experience. User-facing behavior descriptions bridge the gap between technical correctness and user value.

Format: BDD scenarios work well here (see Small Batches):

Scenario: Client exceeds rate limit
  Given an authenticated client
  And the client has made 100 requests in the current minute
  When the client makes another request to /api/search
  Then the response status should be 429
  And the response should include a Retry-After header
  And the Retry-After value should indicate when the limit resets

Scenario: Client within rate limit
  Given an authenticated client
  And the client has made 50 requests in the current minute
  When the client makes a request to /api/search
  Then the request should be processed normally
  And the response should include rate limit headers showing remaining quota

3. Feature Description

What it is: A technical description of the feature’s architecture, dependencies, and integration points.

Why it matters for agentic CD: Agents need explicit architectural context that human developers often carry in their heads. The feature description tells the agent where the change fits in the system, what components it touches, and what constraints apply.

Example:

## Feature: Rate Limiting for Search API

### Architecture
- Rate limit middleware sits between authentication and the search handler
- Rate limit state is stored in Redis (shared across application instances)
- Rate limit configuration is read from the application config, not hardcoded

### Dependencies
- Redis client library (already in use for session storage)
- No new external dependencies should be introduced

### Constraints
- Must not add more than 5ms of latency to the request path
- Must work correctly with our horizontal scaling (3-12 instances)
- Must be configurable per-endpoint (other endpoints may have different limits later)

4. Executable Truth

What it is: Automated tests that define the correct behavior of the system. These tests are the authoritative source of truth for what the code should do.

Why it matters for agentic CD: For human developers, tests verify the code. For agent-generated code, tests also constrain the agent. If the tests are comprehensive, the agent cannot generate incorrect code that passes. If the tests are shallow, the agent can generate code that passes tests but does not satisfy the intent.

Key principle: Executable truth must be written or reviewed by a human before the agent generates the implementation. This inverts the common practice of writing tests after code. In agentic CD, the tests come first because they are the specification.

class TestRateLimiting:
    def test_allows_requests_within_limit(self, client, redis):
        for _ in range(100):
            response = client.get("/api/search", headers=auth_headers)
            assert response.status_code == 200

    def test_blocks_requests_exceeding_limit(self, client, redis):
        for _ in range(100):
            client.get("/api/search", headers=auth_headers)
        response = client.get("/api/search", headers=auth_headers)
        assert response.status_code == 429
        assert "Retry-After" in response.headers

    def test_limit_resets_after_window(self, client, redis, time_machine):
        for _ in range(100):
            client.get("/api/search", headers=auth_headers)
        time_machine.advance(seconds=61)
        response = client.get("/api/search", headers=auth_headers)
        assert response.status_code == 200

    def test_limits_are_per_client(self, client, redis):
        for _ in range(100):
            client.get("/api/search", headers=auth_headers_client_a)
        response = client.get("/api/search", headers=auth_headers_client_b)
        assert response.status_code == 200

    def test_latency_overhead_within_budget(self, client, redis, benchmark):
        result = benchmark(lambda: client.get("/api/search", headers=auth_headers))
        assert result.mean < 0.005  # 5ms budget

5. Implementation

What it is: The actual code that implements the feature. In agentic CD, this may be generated entirely by the agent, co-authored by agent and human, or authored by a human with agent assistance.

Why it matters for agentic CD: The implementation is the artifact most likely to be agent-generated. The key requirement is that it must satisfy the executable truth (tests), conform to the feature description (architecture), and achieve the intent description (purpose).

Review requirements: Agent-generated implementation must be reviewed by a human before merging to trunk. The review focuses on:

  • Does the implementation match the intent? (Not just “does it pass tests?”)
  • Does it follow the architectural constraints in the feature description?
  • Does it introduce unnecessary complexity, dependencies, or security risks?
  • Would a human developer on the team understand and maintain this code?

6. System Constraints

What it is: Non-functional requirements, security policies, performance budgets, and organizational rules that apply to all changes.

Why it matters for agentic CD: Human developers internalize system constraints through experience and team norms. Agents need these constraints stated explicitly.

Examples:

system_constraints:
  security:
    - No secrets in source code
    - All user input must be sanitized
    - Authentication required for all API endpoints
  performance:
    - API p99 latency < 500ms
    - No N+1 query patterns
    - Database queries must use indexes
  architecture:
    - No circular dependencies between modules
    - External service calls must use circuit breakers
    - All new dependencies require team approval
  operations:
    - All new features must have monitoring dashboards
    - Log structured data, not strings
    - Feature flags required for user-visible changes

The Agentic CD Workflow

When an AI agent contributes to a CD pipeline, the workflow extends the standard CD pipeline:

1. HUMAN writes Intent Description
2. HUMAN writes or reviews User-Facing Behavior (BDD scenarios)
3. HUMAN writes or reviews Feature Description (architecture)
4. HUMAN writes or reviews Executable Truth (tests)
5. AGENT generates Implementation (code)
6. PIPELINE validates Implementation against Executable Truth (automated tests)
7. HUMAN reviews Implementation (code review)
8. PIPELINE deploys (same pipeline as any other change)

Key differences from standard CD:

  • Steps 1-4 happen before the agent generates code (test-first is mandatory, not optional)
  • Step 7 (human review) is mandatory for agent-generated code
  • System constraints are checked automatically in the pipeline (Step 6)

Constraints for Agent-Generated Changes

Beyond the six artifacts, agentic CD imposes additional constraints on agent-generated changes:

Change Size Limits

Agent-generated changes must be small. Large agent-generated changes are harder to review and more likely to contain subtle issues.

Guideline: An agent-generated change should modify no more files and no more lines than a human would in a single commit. If the change is larger, break it into multiple sequential changes.

Mandatory Human Review

Every agent-generated change must be reviewed by a human before merging to trunk. This is a non-negotiable constraint. The purpose is not to check the agent’s “work” in a supervisory sense - it is to verify that the change matches the intent and fits the system.

Comprehensive Test Coverage

Agent-generated code must have higher test coverage than the team’s baseline. If the team’s baseline is 80% coverage, agent-generated code should target 90%+. This compensates for the reduced human oversight of the implementation details.

Provenance Tracking

The pipeline must record which changes were agent-generated, which agent generated them, and what prompt or intent description was used. This supports audit, debugging, and learning.

Getting Started with Agentic CD

Before jumping into agentic workflows, ensure your team has the prerequisite delivery practices in place. The AI Adoption Roadmap provides a step-by-step sequence: quality tools, clear requirements, hardened guardrails, and reduced delivery friction - all before accelerating with AI coding.

Phase 1: Agent as Assistant

The agent helps human developers write code, but the human makes all decisions and commits all changes. The pipeline does not know or care about agent involvement.

This is where most teams are today. It requires no pipeline changes.

Phase 2: Agent as Contributor

The agent generates complete changes based on intent descriptions and executable truth. A human reviews and merges. The pipeline validates.

Requires: Explicit intent descriptions, test-first workflow, human review gate.

Phase 3: Agent as Autonomous Contributor

The agent generates, tests, and proposes changes with minimal human involvement. Human review is still mandatory, but the agent handles the full cycle from intent to implementation.

Requires: All six first-class artifacts, comprehensive system constraints, provenance tracking, and high confidence in the executable truth.

Key Pitfalls

1. “We let the agent generate tests and code together”

If the agent writes both the tests and the code, the tests may be designed to pass the code rather than to verify the intent. Tests must be written or reviewed by a human before the agent generates the implementation. This is the most important constraint in agentic CD.

2. “The agent generates changes faster than we can review them”

This is a feature, not a bug - but only if you have the discipline to not merge unreviewed changes. The agent’s speed should not pressure humans to review faster. WIP limits apply: if the review queue is full, the agent stops generating new changes.

3. “We trust the agent because it passed the tests”

Passing tests is necessary but not sufficient. Tests cannot verify intent, architectural fitness, or maintainability. Human review remains mandatory.

4. “We don’t track which changes are agent-generated”

Without provenance tracking, you cannot learn from agent-generated failures, audit agent behavior, or improve the agent’s constraints over time. Track provenance from the start.

Measuring Success

Metric Target Why It Matters
Agent-generated change failure rate Equal to or lower than human-generated Confirms agent changes meet the same quality bar
Review time for agent-generated changes Comparable to human-generated changes Confirms changes are reviewable, not rubber-stamped
Test coverage for agent-generated code Higher than baseline Confirms the additional coverage constraint is met
Agent-generated changes with complete artifacts 100% Confirms the six-artifact workflow is being followed

Next Step

For real-world examples of teams that have made the full journey to continuous deployment, see Experience Reports.


This content is adapted from MinimumCD.org, licensed under CC BY 4.0.

4 - Experience Reports

Real-world stories from teams that have made the journey to continuous deployment.

Phase 4 - Deliver on Demand | Adapted from MinimumCD.org

Theory is necessary but insufficient. This page collects experience reports from organizations that have adopted continuous deployment at scale, including the challenges they faced, the approaches they took, and the results they achieved. These reports demonstrate that CD is not limited to startups or greenfield projects - it works in large, complex, regulated environments.

Why Experience Reports Matter

Every team considering continuous deployment faces the same objection: “That works for [Google / Netflix / small startups], but our situation is different.” Experience reports counter this objection with evidence. They show that organizations of every size, in every industry, with every kind of legacy system, have found a path to continuous deployment.

No experience report will match your situation exactly. That is not the point. The point is to extract patterns: what obstacles did these teams encounter, and how did they overcome them?

Walmart: CD at Retail Scale

Context

Walmart operates one of the world’s largest e-commerce platforms alongside its massive physical retail infrastructure. Changes to the platform affect millions of transactions per day. The organization had a traditional release process with weekly deployment windows and multi-stage manual approval.

The Challenge

  • Scale: Thousands of developers across hundreds of teams
  • Risk tolerance: Any outage affects revenue in real time
  • Legacy: Decades of existing systems with deep interdependencies
  • Regulation: PCI compliance requirements for payment processing

What They Did

  • Invested in a centralized deployment platform (OneOps, later Concord) that standardized the deployment pipeline across all teams
  • Broke the monolithic release into independent service deployments
  • Implemented automated canary analysis for every deployment
  • Moved from weekly release trains to on-demand deployment per team

Key Lessons

  1. Platform investment pays off. Building a shared deployment platform let hundreds of teams adopt CD without each team solving the same infrastructure problems.
  2. Compliance and CD are compatible. Automated pipelines with full audit trails satisfied PCI requirements more reliably than manual approval processes.
  3. Cultural change is harder than technical change. Teams that had operated on weekly release cycles for years needed coaching and support to trust automated deployment.

Microsoft: From Waterfall to Daily Deploys

Context

Microsoft’s Azure DevOps (formerly Visual Studio Team Services) team made a widely documented transformation from 3-year waterfall releases to deploying multiple times per day. This transformation happened within one of the largest software organizations in the world.

The Challenge

  • History: Decades of waterfall development culture
  • Product complexity: A platform used by millions of developers
  • Organizational size: Thousands of engineers across multiple time zones
  • Customer expectations: Enterprise customers expected stability and predictability

What They Did

  • Broke the product into independently deployable services (ring-based deployment)
  • Implemented a ring-based rollout: Ring 0 (team), Ring 1 (internal Microsoft users), Ring 2 (select external users), Ring 3 (all users)
  • Invested heavily in automated testing, achieving thousands of tests running in minutes
  • Moved from a fixed release cadence to continuous deployment with feature flags controlling release
  • Used telemetry to detect issues in real-time and automated rollback when metrics degraded

Key Lessons

  1. Ring-based deployment is progressive rollout. Microsoft’s ring model is an implementation of the progressive rollout strategies described in this guide.
  2. Feature flags enabled decoupling. By deploying frequently but releasing features incrementally via flags, the team could deploy without worrying about feature completeness.
  3. The transformation took years, not months. Moving from 3-year cycles to daily deployment was a multi-year journey with incremental progress at each step.

Google: Engineering Productivity at Scale

Context

Google is often cited as the canonical example of continuous deployment, deploying changes to production thousands of times per day across its vast service portfolio.

The Challenge

  • Scale: Billions of users, millions of servers
  • Monorepo: Most of Google operates from a single repository with billions of lines of code
  • Interdependencies: Changes in shared libraries can affect thousands of services
  • Velocity: Thousands of engineers committing changes every day

What They Did

  • Built a culture of automated testing where tests are a first-class deliverable, not an afterthought
  • Implemented a submit queue that runs automated tests on every change before it merges to the trunk
  • Invested in build infrastructure (Blaze/Bazel) that can build and test only the affected portions of the codebase
  • Used percentage-based rollout for user-facing changes
  • Made rollback a one-click operation available to every team

Key Lessons

  1. Test infrastructure is critical infrastructure. Google’s ability to deploy frequently depends entirely on its ability to test quickly and reliably.
  2. Monorepo and CD are compatible. The common assumption that CD requires microservices with separate repos is false. Google deploys from a monorepo.
  3. Invest in tooling before process. Google built the tooling (build systems, test infrastructure, deployment automation) that made good practices the path of least resistance.

Amazon: Two-Pizza Teams and Ownership

Context

Amazon’s transformation to service-oriented architecture and team ownership is one of the most influential in the industry. The “two-pizza team” model and “you build it, you run it” philosophy directly enabled continuous deployment.

The Challenge

  • Organizational size: Hundreds of thousands of employees
  • System complexity: Thousands of services powering amazon.com and AWS
  • Availability requirements: Even brief outages are front-page news
  • Pace of innovation: Competitive pressure demands rapid feature delivery

What They Did

  • Decomposed the system into independently deployable services, each owned by a small team
  • Gave teams full ownership: build, test, deploy, operate, and support
  • Built internal deployment tooling (Apollo) that automates canary analysis, rollback, and one-click deployment
  • Established the practice of deploying every commit that passes the pipeline, with automated rollback on metric degradation

Key Lessons

  1. Ownership drives quality. When the team that writes the code also operates it in production, they write better code and build better monitoring.
  2. Small teams move faster. Two-pizza teams (6-10 people) can make decisions without bureaucratic overhead.
  3. Automation eliminates toil. Amazon’s internal deployment tooling means that deploying is not a skilled activity - any team member can deploy (and the pipeline usually deploys automatically).

HP: CD in Hardware-Adjacent Software

Context

HP’s LaserJet firmware team demonstrated that continuous delivery principles apply even to embedded software, a domain often considered incompatible with frequent deployment.

The Challenge

  • Embedded software: Firmware that runs on physical printers
  • Long development cycles: Firmware releases had traditionally been annual
  • Quality requirements: Firmware bugs require physical recalls or complex update procedures
  • Team size: Large, distributed teams with varying skill levels

What They Did

  • Invested in automated testing infrastructure for firmware
  • Reduced build times from days to under an hour
  • Moved from annual releases to frequent incremental updates
  • Implemented continuous integration with automated test suites running on simulator and hardware

Key Lessons

  1. CD principles are universal. Even embedded firmware can benefit from small batches, automated testing, and continuous integration.
  2. Build time is a critical constraint. Reducing build time from days to under an hour unlocked the ability to test frequently, which enabled frequent integration, which enabled frequent delivery.
  3. Results were dramatic: Development costs reduced by approximately 40%, programs delivered on schedule increased by roughly 140%.

Flickr: “10+ Deploys Per Day”

Context

Flickr’s 2009 presentation “10+ Deploys Per Day: Dev and Ops Cooperation” is credited with helping launch the DevOps movement. At a time when most organizations deployed quarterly, Flickr was deploying more than ten times per day.

The Challenge

  • Web-scale service: Serving billions of photos to millions of users
  • Ops/Dev divide: Traditional separation between development and operations teams
  • Fear of change: Deployments were infrequent because they were risky

What They Did

  • Built automated infrastructure provisioning and deployment
  • Implemented feature flags to decouple deployment from release
  • Created a culture of shared responsibility between development and operations
  • Made deployment a routine, low-ceremony event that anyone could trigger
  • Used IRC bots (and later chat-based tools) to coordinate and log deployments

Key Lessons

  1. Culture is the enabler. Flickr’s technical practices were important, but the cultural shift - developers and operations working together, shared responsibility, mutual respect - was what made frequent deployment possible.
  2. Tooling should reduce friction. Flickr’s deployment tools were designed to make deploying as easy as possible. The easier it is to deploy, the more often people deploy, and the smaller each deployment becomes.
  3. Transparency builds trust. Logging every deployment in a shared channel let everyone see what was deploying, who deployed it, and whether it caused problems. This transparency built organizational trust in frequent deployment.

Common Patterns Across Reports

Despite the diversity of these organizations, several patterns emerge consistently:

1. Investment in Automation Precedes Cultural Change

Every organization built the tooling first. Automated testing, automated deployment, automated rollback - these created the conditions where frequent deployment was possible. Cultural change followed when people saw that the automation worked.

2. Incremental Adoption, Not Big Bang

No organization switched to continuous deployment overnight. They all moved incrementally: shorter release cycles first, then weekly deploys, then daily, then on-demand. Each step built confidence for the next.

3. Team Ownership Is Essential

Organizations that gave teams ownership of their deployments (build it, run it) moved faster than those that kept deployment as a centralized function. Ownership creates accountability, which drives quality.

4. Feature Flags Are Universal

Every organization in these reports uses feature flags to decouple deployment from release. This is not optional for continuous deployment - it is foundational.

5. The Results Are Consistent

Regardless of industry, size, or starting point, organizations that adopt continuous deployment consistently report:

  • Higher deployment frequency (daily or more)
  • Lower change failure rate (small changes fail less)
  • Faster recovery (automated rollback, small blast radius)
  • Higher developer satisfaction (less toil, more impact)
  • Better business outcomes (faster time to market, reduced costs)

Applying These Lessons to Your Migration

You do not need to be Google-sized to benefit from these patterns. Extract what applies:

  1. Start with automation. Build the pipeline, the tests, the rollback mechanism.
  2. Adopt incrementally. Move from monthly to weekly to daily. Do not try to jump to 10 deploys per day on day one.
  3. Give teams ownership. Let teams deploy their own services.
  4. Use feature flags. Decouple deployment from release.
  5. Measure and improve. Track DORA metrics. Run experiments. Use retrospectives.

These are the practices covered throughout this migration guide. The experience reports confirm that they work - not in theory, but in production, at scale, in the real world.

Further Reading

For detailed experience reports and additional case studies, see:

  • MinimumCD.org Experience Reports - Collected reports from organizations practicing minimum CD
  • Accelerate by Nicole Forsgren, Jez Humble, and Gene Kim - The research behind DORA metrics, with extensive case study data
  • Continuous Delivery by Jez Humble and David Farley - The foundational text, with detailed examples from multiple organizations
  • The DevOps Handbook by Gene Kim, Jez Humble, Patrick Debois, and John Willis - Case studies from organizations across industries

This content is adapted from MinimumCD.org, licensed under CC BY 4.0.