This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Testing

Anti-patterns in test strategy, test architecture, and quality practices that block continuous delivery.

These anti-patterns affect how teams build confidence that their code is safe to deploy. They create slow pipelines, flaky feedback, and manual gates that prevent the continuous flow of changes to production.

1 - No Test Automation

Zero automated tests. The team has no idea where to start and the codebase was not designed for testability.

Category: Testing & Quality | Quality Impact: Critical

What This Looks Like

The team deploys by manually verifying things work. Someone clicks through the application, checks a few screens, and declares it good. There is no test suite. No test runner configured. No test directory in the repository. The CI server, if one exists, builds the code and stops there.

When a developer asks “how do I know if my change broke something?” the answer is either “you don’t” or “someone from QA will check it.” Bugs discovered in production are treated as inevitable. Nobody connects the lack of automated tests to the frequency of production incidents because there is no baseline to compare against.

Common variations:

  • Tests exist but are never run. Someone wrote tests a year ago. The test suite is broken and nobody has fixed it. The tests are checked into the repository but are not part of any pipeline or workflow.
  • Manual test scripts as the safety net. A spreadsheet or wiki page lists hundreds of manual test cases. Before each release, someone walks through them by hand. The process takes days. It is the only verification the team has.
  • Testing is someone else’s job. Developers write code. A separate QA team tests it days or weeks later. The feedback loop is so long that developers have moved on to other work by the time defects are found.
  • “The code is too legacy to test.” The team has decided the codebase is untestable. Functions are thousands of lines long, everything depends on global state, and there are no seams where test doubles could be inserted. This belief becomes self-fulfilling - nobody tries because everyone agrees it is impossible.

The telltale sign: when a developer makes a change, the only way to verify it works is to deploy it and see what happens.

Why This Is a Problem

Without automated tests, every change is a leap of faith. The team has no fast, reliable way to know whether code works before it reaches users. Every downstream practice that depends on confidence in the code - continuous integration, automated deployment, frequent releases - is blocked.

It reduces quality

When there are no automated tests, defects are caught by humans or by users. Humans are slow, inconsistent, and unable to check everything. A manual tester cannot verify 500 behaviors in an hour, but an automated suite can. The behaviors that are not checked are the ones that break.

Developers writing code without tests have no feedback on whether their logic is correct until someone else exercises it. A function that handles an edge case incorrectly will not be caught until a user hits that edge case in production. By then, the developer has moved on and lost context on the code they wrote.

With even a basic suite of automated tests, developers get feedback in minutes. They catch their own mistakes while the code is fresh. The suite runs the same checks every time, never forgetting an edge case and never getting tired.

It increases rework

Without tests, rework comes from two directions. First, bugs that reach production must be investigated, diagnosed, and fixed - work that an automated test would have prevented. Second, developers are afraid to change existing code because they have no way to verify they have not broken something. This fear leads to workarounds: copy-pasting code instead of refactoring, adding conditional branches instead of restructuring, and building new modules alongside old ones instead of modifying what exists.

Over time, the codebase becomes a patchwork of workarounds layered on workarounds. Each change takes longer because the code is harder to understand and more fragile. The absence of tests is not just a testing problem - it is a design problem that compounds with every change.

Teams with automated tests refactor confidently. They rename functions, extract modules, and simplify logic knowing that the test suite will catch regressions. The codebase stays clean because changing it is safe.

It makes delivery timelines unpredictable

Without automated tests, the time between “code complete” and “deployed” is dominated by manual verification. How long that verification takes depends on how many changes are in the batch, how available the testers are, and how many defects they find. None of these variables are predictable.

A change that a developer finishes on Monday might not be verified until Thursday. If defects are found, the cycle restarts. Lead time from commit to production is measured in weeks, and the variance is enormous. Some changes take three days, others take three weeks, and the team cannot predict which.

Automated tests collapse the verification step to minutes. The time from “code complete” to “verified” becomes a constant, not a variable. Lead time becomes predictable because the largest source of variance has been removed.

Impact on continuous delivery

Automated tests are the foundation of continuous delivery. Without them, there is no automated quality gate. Without an automated quality gate, there is no safe way to deploy frequently. Without frequent deployment, there is no fast feedback from production. Every CD practice assumes that the team can verify code quality automatically. A team with no test automation is not on a slow path to CD - they have not started.

How to Fix It

Starting test automation on an untested codebase feels overwhelming. The key is to start small, establish the habit, and expand coverage incrementally. You do not need to test everything before you get value - you need to test something and keep going.

Step 1: Set up the test infrastructure (Week 1)

Before writing a single test, make it trivially easy to run tests:

  1. Choose a test framework for your primary language. Pick the most popular one - do not deliberate.
  2. Add the framework to the project. Configure it. Write a single test that asserts true == true and verify it passes.
  3. Add a test script or command to the project so that anyone can run the suite with a single command (e.g., npm test, pytest, mvn test).
  4. Add the test command to the CI pipeline so that tests run on every push.

The goal for week one is not coverage. It is infrastructure: a working test runner in the pipeline that the team can build on.

Step 2: Write tests for every new change (Week 2+)

Establish a team rule: every new change must include at least one automated test. Not “every new feature” - every change. Bug fixes get a regression test that fails without the fix and passes with it. New functions get a test that verifies the core behavior. Refactoring gets a test that pins the existing behavior before changing it.

This rule is more important than retroactive coverage. New code enters the codebase tested. The tested portion grows with every commit. After a few months, the most actively changed code has coverage, which is exactly where coverage matters most.

Step 3: Target high-change areas for retroactive coverage (Weeks 3-6)

Use your version control history to find the files that change most often. These are the files where bugs are most likely and where tests provide the most value:

  1. List the 10 files with the most commits in the last six months.
  2. For each file, write tests for its core public behavior. Do not try to test every line - test the functions that other code depends on.
  3. If the code is hard to test because of tight coupling, wrap it. Create a thin adapter around the untestable code and test the adapter. This is the Strangler Fig pattern applied to testing.

Step 4: Make untestable code testable incrementally (Weeks 4-8)

If the codebase resists testing, introduce seams one at a time:

Problem Technique
Function does too many things Extract the pure logic into a separate function and test that
Hard-coded database calls Introduce a repository interface, inject it, test with a fake
Global state or singletons Pass dependencies as parameters instead of accessing globals
No dependency injection Start with “poor man’s DI” - default parameters that can be overridden in tests

You do not need to refactor the entire codebase. Each time you touch a file, leave it slightly more testable than you found it.

Step 5: Set a coverage floor and ratchet it up (Week 6+)

Once you have meaningful coverage in actively changed code, set a coverage threshold in the pipeline:

  1. Measure current coverage. Say it is 15%.
  2. Set the pipeline to fail if coverage drops below 15%.
  3. Every two weeks, raise the floor by 2-5 percentage points.

The floor prevents backsliding. The ratchet ensures progress. The team does not need to hit 90% coverage - they need to ensure that coverage only goes up.

Objection Response
“The codebase is too legacy to test” You do not need to test the legacy code directly. Wrap it in testable adapters and test those. Every new change gets a test. Coverage grows from the edges inward.
“We don’t have time to write tests” You are already spending that time on manual verification and production debugging. Tests shift that cost to the left where it is cheaper. Start with one test per change - the overhead is minutes, not hours.
“We need to test everything before it’s useful” One test that catches one regression is more useful than zero tests. The value is immediate and cumulative. You do not need full coverage to start getting value.
“Developers don’t know how to write tests” Pair a developer who has testing experience with one who does not. If nobody on the team has experience, invest one day in a testing workshop. The skill is learnable in a week.

Measuring Progress

Metric What to look for
Test count Should increase every sprint
Code coverage of actively changed files More meaningful than overall coverage - focus on files changed in the last 30 days
Build duration Should increase slightly as tests are added, but stay under 10 minutes
Defects found in production vs. in tests Ratio should shift toward tests over time
Change fail rate Should decrease as test coverage catches regressions before deployment
Manual testing effort per release Should decrease as automated tests replace manual verification

2 - Manual Regression Testing Gates

Every release requires days or weeks of manual testing. Testers execute scripted test cases. Test effort scales linearly with application size.

Category: Testing & Quality | Quality Impact: Critical

What This Looks Like

Before every release, the team enters a testing phase. Testers open a spreadsheet or test management tool containing hundreds of scripted test cases. They walk through each one by hand: click this button, enter this value, verify this result. The testing takes days. Sometimes it takes weeks. Nothing ships until every case is marked pass or fail, and every failure is triaged.

Developers stop working on new features during this phase because testers need a stable build to test against. Code freezes go into effect. Bug fixes discovered during testing must be applied carefully to avoid invalidating tests that have already passed. The team enters a holding pattern where the only work that matters is getting through the test cases.

The testing effort grows with every release. New features add new test cases, but old test cases are rarely removed because nobody is confident they are redundant. A team that tested for three days six months ago now tests for five. The spreadsheet has 800 rows. Every release takes longer to validate than the last.

Common variations:

  • The regression spreadsheet. A master spreadsheet of every test case the team has ever written. Before each release, a tester works through every row. The spreadsheet is the institutional memory of what the software is supposed to do, and nobody trusts anything else.
  • The dedicated test phase. The sprint cadence is two weeks of development followed by one week of testing. The test week is a mini-waterfall phase embedded in an otherwise agile process. Nothing can ship until the test phase is complete.
  • The test environment bottleneck. Manual testing requires a specific environment that is shared across teams. The team must wait for their slot. When the environment is broken by another team’s testing, everyone waits for it to be restored.
  • The sign-off ceremony. A QA lead or manager must personally verify a subset of critical paths and sign a document before the release can proceed. If that person is on vacation, the release waits.
  • The compliance-driven test cycle. Regulatory requirements are interpreted as requiring manual execution of every test case with documented evidence. Each test run produces screenshots and sign-off forms. The documentation takes as long as the testing itself.

The telltale sign: if the question “can we release today?” is always answered with “not until QA finishes,” manual regression testing is gating your delivery.

Why This Is a Problem

Manual regression testing feels responsible. It feels thorough. But it creates a bottleneck that grows worse with every feature the team builds, and the thoroughness it promises is an illusion.

It reduces quality

Manual testing is less reliable than it appears. A human executing the same test case for the hundredth time will miss things. Attention drifts. Steps get skipped. Edge cases that seemed important when the test was written get glossed over when the tester is on row 600 of a spreadsheet. Studies on manual testing consistently show that testers miss 15-30% of defects that are present in the software they are testing.

The test cases themselves decay. They were written for the version of the software that existed when the feature shipped. As the product evolves, some cases become irrelevant, others become incomplete, and nobody updates them systematically. The team is executing a test plan that partially describes software that no longer exists.

The feedback delay compounds the quality problem. A developer who wrote code two weeks ago gets a bug report from a tester during the regression cycle. The developer has lost context on the change. They re-read their own code, try to remember what they were thinking, and fix the bug with less confidence than they would have had the day they wrote it.

Automated tests catch the same classes of bugs in seconds, with perfect consistency, every time the code changes. They do not get tired on row 600. They do not skip steps. They run against the current version of the software, not a test plan written six months ago. And they give feedback immediately, while the developer still has full context.

It increases rework

The manual testing gate creates a batch-and-queue cycle. Developers write code for two weeks, then testers spend a week finding bugs in that code. Every bug found during the regression cycle is rework: the developer must stop what they are doing, reload the context of a completed story, diagnose the issue, fix it, and send it back to the tester for re-verification. The re-verification may invalidate other test cases, requiring additional re-testing.

The batch size amplifies the rework. When two weeks of changes are tested together, a bug could be in any of dozens of commits. Narrowing down the cause takes longer because there are more variables. When the same bug would have been caught by an automated test minutes after it was introduced, the developer would have fixed it in the same sitting - one context switch instead of many.

The rework also affects testers. A bug fix during the regression cycle means the tester must re-run affected test cases. If the fix changes behavior elsewhere, the tester must re-run those cases too. A single bug fix can cascade into hours of re-testing, pushing the release date further out.

With automated regression tests, bugs are caught as they are introduced. The fix happens immediately. There is no regression cycle, no re-testing cascade, and no context-switching penalty.

It makes delivery timelines unpredictable

The regression testing phase takes as long as it takes. The team cannot predict how many bugs the testers will find, how long each fix will take, or how much re-testing the fixes will require. A release planned for Friday might slip to the following Wednesday. Or the following Friday.

This unpredictability cascades through the organization. Product managers cannot commit to delivery dates because they do not know how long testing will take. Stakeholders learn to pad their expectations. “We’ll release in two weeks” really means “we’ll release in two to four weeks, depending on what QA finds.”

The unpredictability also creates pressure to cut corners. When the release is already three days late, the team faces a choice: re-test thoroughly after a late bug fix, or ship without full re-testing. Under deadline pressure, most teams choose the latter. The manual testing gate that was supposed to ensure quality becomes the reason quality is compromised.

Automated regression suites produce predictable, repeatable results. The suite runs in the same amount of time every time. There is no testing phase to slip. The team knows within minutes of every commit whether the software is releasable.

It creates a permanent scaling problem

Manual testing effort scales linearly with application size. Every new feature adds test cases. The test suite never shrinks. A team that takes three days to test today will take four days in six months and five days in a year. The testing phase consumes an ever-growing fraction of the team’s capacity.

This scaling problem is invisible at first. Three days of testing feels manageable. But the growth is relentless. The team that started with 200 test cases now has 800. The test phase that was two days is now a week. And because the test cases were written by different people at different times, nobody can confidently remove any of them without risking a missed regression.

Automated tests scale differently. Adding a new automated test adds milliseconds to the suite duration, not hours to the testing phase. A team with 10,000 automated tests runs them in the same 10 minutes as a team with 1,000. The cost of confidence is fixed, not linear.

Impact on continuous delivery

Manual regression testing is fundamentally incompatible with continuous delivery. CD requires that any commit can be released at any time. A manual testing gate that takes days means the team can release at most once per testing cycle. If the gate takes a week, the team releases at most every two or three weeks - regardless of how fast their pipeline is or how small their changes are.

The manual gate also breaks the feedback loop that CD depends on. CD gives developers confidence that their change works by running automated checks within minutes. A manual gate replaces that fast feedback with a slow, batched, human process that cannot keep up with the pace of development.

You cannot have continuous delivery with a manual regression gate. The two are mutually exclusive. The gate must be automated before CD is possible.

How to Fix It

Step 1: Catalog your manual test cases and categorize them (Week 1)

Before automating anything, understand what the manual test suite actually covers. For every test case in the regression suite:

  1. Identify what behavior it verifies.
  2. Classify it: is it testing business logic, a UI flow, an integration boundary, or a compliance requirement?
  3. Rate its value: has this test ever caught a real bug? When was the last time?
  4. Rate its automation potential: can this be tested at a lower level (unit, functional, API)?

Most teams discover that a large percentage of their manual test cases are either redundant (the same behavior is tested multiple times), outdated (the feature has changed), or automatable at a lower level.

Step 2: Automate the highest-value cases first (Weeks 2-4)

Pick the 20 test cases that cover the most critical paths - the ones that would cause the most damage if they regressed. Automate them:

  • Business logic tests become unit tests.
  • API behavior tests become functional tests.
  • Critical user journeys become a small set of E2E smoke tests.

Do not try to automate everything at once. Start with the cases that give the most confidence per minute of execution time. The goal is to build a fast automated suite that covers the riskiest scenarios so the team no longer depends on manual execution for those paths.

Step 3: Run automated tests in the pipeline on every commit (Week 3)

Move the new automated tests into the CI pipeline so they run on every push. This is the critical shift: testing moves from a phase at the end of development to a continuous activity that happens with every change.

Every commit now gets immediate feedback on the critical paths. If a regression is introduced, the developer knows within minutes - not weeks.

Step 4: Shrink the manual suite as automation grows (Weeks 4-8)

Each week, pick another batch of manual test cases and either automate or retire them:

  • Automate cases where the behavior is stable and testable at a lower level.
  • Retire cases that are redundant with existing automated tests or that test behavior that no longer exists.
  • Keep manual only for genuinely exploratory testing that requires human judgment - usability evaluation, visual design review, or complex workflows that resist automation.

Track the shrinkage. If the manual suite had 800 cases and now has 400, that is progress. If the manual testing phase took five days and now takes two, that is measurable improvement.

Step 5: Replace the testing phase with continuous testing (Weeks 6-8+)

The goal is to eliminate the dedicated testing phase entirely:

Before After
Code freeze before testing No code freeze - trunk is always testable
Testers execute scripted cases Automated suite runs on every commit
Bugs found days or weeks after coding Bugs found minutes after coding
Testing phase blocks release Release readiness checked automatically
QA sign-off required Pipeline pass is the sign-off
Testers do manual regression Testers do exploratory testing, write automated tests, and improve test infrastructure

Step 6: Address the objections (Ongoing)

Objection Response
“Automated tests can’t catch everything a human can” Correct. But humans cannot execute 800 test cases reliably in a day, and automated tests can. Automate the repeatable checks and free humans for the exploratory testing where their judgment adds value.
“We need manual testing for compliance” Most compliance frameworks require evidence that testing was performed, not that humans performed it. Automated test reports with pass/fail results, timestamps, and traceability to requirements satisfy most audit requirements better than manual spreadsheets. Confirm with your compliance team.
“Our testers don’t know how to write automated tests” Pair testers with developers. The tester contributes domain knowledge - what to test and why - while the developer contributes automation skills. Over time, the tester learns automation and the developer learns testing strategy.
“We can’t automate tests for our legacy system” Start with new code. Every new feature gets automated tests. For legacy code, automate the most critical paths first and expand coverage as you touch each area. The legacy system does not need 100% automation overnight.
“What if we automate a test wrong and miss a real bug?” Manual tests miss real bugs too - consistently. An automated test that is wrong can be fixed once and stays fixed. A manual tester who skips a step makes the same mistake next time. Automation is not perfect, but it is more reliable and more improvable than manual execution.

Measuring Progress

Metric What to look for
Manual test case count Should decrease steadily as cases are automated or retired
Manual testing phase duration Should shrink toward zero
Automated test count in pipeline Should grow as manual cases are converted
Release frequency Should increase as the manual gate shrinks
Development cycle time Should decrease as the testing phase is eliminated
Time from code complete to release Should converge toward pipeline duration, not testing phase duration

3 - Flaky Test Suites

Tests randomly pass or fail. Developers rerun the pipeline until it goes green. Nobody trusts the test suite to tell them anything useful.

Category: Testing & Quality | Quality Impact: High

What This Looks Like

A developer pushes a change. The pipeline fails. They look at the failure - it is a test they did not touch, in a module they did not change. They click “rerun.” It passes. They merge.

This happens multiple times a day across the team. Nobody investigates failures on the first occurrence because the odds favor flakiness over a real problem. When someone mentions a test failure in standup, the first question is “did you rerun it?” not “what broke?”

Common variations:

  • The nightly lottery. The full suite runs overnight. Every morning, a different random subset of tests is red. Someone triages the failures, marks most as flaky, and the team moves on. Real regressions hide in the noise.
  • The retry-until-green pattern. The pipeline configuration automatically reruns failed tests two or three times. If a test passes on any attempt, it counts as passed. The team considers this solved. In reality, it masks failures and doubles or triples pipeline duration.
  • The “known flaky” tag. Tests are annotated with a skip or known-flaky marker. The suite ignores them. The list grows over time. Nobody goes back to fix them because they are out of sight.
  • Environment-dependent failures. Tests pass on developer machines but fail in CI, or pass in CI but fail on Tuesdays. The failures correlate with shared test environments, time-of-day load patterns, or external service availability.
  • Test order dependency. Tests pass when run in a specific order but fail when run in isolation or in a different sequence. Shared mutable state from one test leaks into another.

The telltale sign: the team has a shared understanding that the first pipeline failure “doesn’t count.” Rerunning the pipeline is a routine step, not an exception.

Why This Is a Problem

Flaky tests are not a minor annoyance. They systematically destroy the value of the test suite by making it impossible to distinguish signal from noise. A test suite that sometimes lies is worse than no test suite at all, because it creates an illusion of safety.

It reduces quality

When tests fail randomly, developers stop trusting them. The rational response to a flaky suite is to ignore failures - and that is exactly what happens. A developer whose pipeline fails three times a week for reasons unrelated to their code learns to click “rerun” without reading the error message.

This behavior is invisible most of the time. It becomes catastrophic when a real regression happens. The test that catches the regression fails, the developer reruns because “it’s probably flaky,” it passes on the second run because the flaky behavior went the other way, and the regression ships to production. The test did its job, but the developer’s trained behavior neutralized it.

In a suite with zero flaky tests, every failure demands investigation. Developers read the error, find the cause, and fix it. Failures are rare and meaningful. The suite functions as a reliable quality gate.

It increases rework

Flaky tests cause rework in two ways. First, developers spend time investigating failures that turn out to be noise. A developer sees a test failure, spends 20 minutes reading the error and reviewing their change, realizes the failure is unrelated, and reruns. Multiply this by every developer on the team, multiple times per day.

Second, the retry-until-green pattern extends pipeline duration. A pipeline that should take 8 minutes takes 20 because failed tests are rerun automatically. Developers wait longer for feedback, context-switch more, and lose more time to task switching.

Teams with deterministic test suites waste zero time investigating flaky failures. Their pipeline runs once, gives an answer, and the developer acts on it.

It makes delivery timelines unpredictable

A flaky suite introduces randomness into the delivery process. The same code, submitted twice, might pass the pipeline on the first attempt or take three reruns. Lead time from commit to merge varies not because of code quality but because of test noise.

When the team needs to ship urgently, flaky tests become a source of anxiety. “Will the pipeline pass this time?” The team starts planning around the flakiness - running the pipeline early “in case it fails,” avoiding changes late in the day because there might not be time for reruns. The delivery process is shaped by the unreliability of the tests rather than by the quality of the code.

Deterministic tests make delivery time a function of code quality alone. The pipeline is a predictable step that takes the same amount of time every run. There are no surprises.

It normalizes ignoring failures

The most damaging effect of flaky tests is cultural. Once a team accepts that test failures are often noise, the standard for investigating failures drops permanently. New team members learn from day one that “you just rerun it.” The bar for adding a flaky test to the suite is low because one more flaky test is barely noticeable when there are already dozens.

This normalization extends beyond tests. If the team tolerates unreliable automated checks, they will tolerate unreliable monitoring, unreliable alerts, and unreliable deploys. Flaky tests teach the team that automation is not trustworthy - exactly the opposite of what CD requires.

Impact on continuous delivery

Continuous delivery depends on automated quality gates that the team trusts completely. A flaky suite is a quality gate with a broken lock - it looks like it is there, but it does not actually stop anything. Developers bypass it by rerunning. Regressions pass through it by luck.

The pipeline must be a machine that answers one question with certainty: “Is this change safe to deploy?” A flaky suite answers “probably, maybe, rerun and ask again.” That is not a foundation you can build continuous delivery on.

How to Fix It

Step 1: Measure the flakiness (Week 1)

Before fixing anything, quantify the problem:

  1. Collect pipeline run data for the last 30 days. Count the number of runs that failed and were rerun without code changes.
  2. Identify which specific tests failed across those reruns. Rank them by failure frequency.
  3. Calculate the pipeline pass rate: what percentage of first-attempt runs succeed?

This gives you a hit list and a baseline. If your first-attempt pass rate is 60%, you know 40% of pipeline runs are wasted on flaky failures.

Step 2: Quarantine the worst offenders (Week 1)

Take the top 10 flakiest tests and move them out of the pipeline-gating suite immediately. Do not fix them yet - just remove them from the critical path.

  • Move them to a separate test suite that runs on a schedule (nightly or hourly) but does not block merges.
  • Create a tracking issue for each quarantined test with its failure rate and the suspected cause.

This immediately improves pipeline reliability. The team sees fewer false failures on day one.

Step 3: Fix or replace quarantined tests (Weeks 2-4)

Work through the quarantined tests systematically. For each one, identify the root cause:

Root cause Fix
Shared mutable state (database, filesystem, cache) Isolate test data. Each test creates and destroys its own state. Use transactions or test containers.
Timing dependencies (sleep, setTimeout, polling) Replace time-based waits with event-based waits. Wait for a condition, not a duration.
Test order dependency Ensure each test is self-contained. Run tests in random order to surface hidden dependencies.
External service dependency Replace with a test double. Validate the double with a contract test.
Race conditions in async code Use deterministic test patterns. Await promises. Avoid fire-and-forget in test code.
Resource contention (ports, files, shared environments) Allocate unique resources per test. Use random ports. Use temp directories.

For each quarantined test, either fix it and return it to the gating suite or replace it with a deterministic lower-level test that covers the same behavior.

Step 4: Prevent new flaky tests from entering the suite (Week 3+)

Establish guardrails so the problem does not recur:

  1. Run new tests 10 times in CI before merging them. If any run fails, the test is flaky and must be fixed before it enters the suite.
  2. Run the full suite in random order. This surfaces order-dependent tests immediately.
  3. Track the pipeline first-attempt pass rate as a team metric. Make it visible on a dashboard. Set a target (e.g., 95%) and treat drops below the target as incidents.
  4. Add a team working agreement: flaky tests are treated as bugs with the same priority as production defects.

Step 5: Eliminate automatic retries (Week 4+)

If the pipeline is configured to automatically retry failed tests, turn it off. Retries mask flakiness instead of surfacing it. Once the quarantine and prevention steps are in place, the suite should be reliable enough to run once.

If a test fails, it should mean something. Retries teach the team that failures are meaningless.

Objection Response
“Retries are fine - they handle transient issues” Transient issues in a test suite are a symptom of external dependencies or shared state. Fix the root cause instead of papering over it with retries.
“We don’t have time to fix flaky tests” Calculate the time the team spends rerunning pipelines and investigating false failures. It is almost always more than the time to fix the flaky tests.
“Some flakiness is inevitable with E2E tests” That is an argument for fewer E2E tests, not for tolerating flakiness. Push the test down to a level where it can be deterministic.
“The flaky test sometimes catches real bugs” A test that catches real bugs 5% of the time and false-alarms 20% of the time is a net negative. Replace it with a deterministic test that catches the same bugs 100% of the time.

Measuring Progress

Metric What to look for
Pipeline first-attempt pass rate Should climb toward 95%+
Number of quarantined tests Should decrease to zero as tests are fixed or replaced
Pipeline reruns per week Should drop to near zero
Build duration Should decrease as retries are removed
Development cycle time Should decrease as developers stop waiting for reruns
Developer trust survey Ask quarterly: “Do you trust the test suite to catch real problems?” Answers should improve.

4 - Inverted Test Pyramid

Most tests are slow end-to-end or UI tests. Few unit tests. The test suite is slow, brittle, and expensive to maintain.

Category: Testing & Quality | Quality Impact: High

What This Looks Like

The team has tests, but the wrong kind. Running the full suite takes 30 minutes or more. Tests fail randomly. Developers rerun the pipeline and hope for green. When a test fails, the first question is “is that a real failure or a flaky test?” rather than “what did I break?”

Common variations:

  • The ice cream cone. Most testing is manual. Below that, a large suite of end-to-end browser tests. A handful of integration tests. Almost no unit tests. The manual testing takes days, the E2E suite takes hours, and nothing runs fast enough to give developers feedback while they code.
  • The E2E-first approach. The team believes end-to-end tests are “real” tests because they test the “whole system.” Unit tests are dismissed as “not testing anything useful” because they use mocks. The result is a suite of 500 Selenium tests that take 45 minutes and fail 10% of the time.
  • The integration test swamp. Every test boots a real database, calls real services, and depends on shared test environments. Tests are slow because they set up and tear down heavy infrastructure. They are flaky because they depend on network availability and shared mutable state.
  • The UI test obsession. The team writes tests exclusively through the UI layer. Business logic that could be verified in milliseconds with a unit test is instead tested through a full browser automation flow that takes seconds per assertion.
  • The “we have coverage” illusion. Code coverage is high because the E2E tests exercise most code paths. But the tests are so slow and brittle that developers do not run them locally. They push code and wait 40 minutes to learn if it works. If a test fails, they assume it is flaky and rerun.

The telltale sign: developers do not trust the test suite. They push code and go get coffee. When tests fail, they rerun before investigating. When a test is red for days, nobody is alarmed.

Why This Is a Problem

An inverted test pyramid does not just slow the team down. It actively undermines every benefit that testing is supposed to provide.

The suite is too slow to give useful feedback

The purpose of a test suite is to tell developers whether their change works - fast enough that they can act on the feedback while they still have context. A suite that runs in seconds gives feedback during development. A suite that runs in minutes gives feedback before the developer moves on. A suite that runs in 30 or more minutes gives feedback after the developer has started something else entirely.

When the suite takes 40 minutes, developers do not run it locally. They push to CI and context- switch to a different task. When the result comes back, they have lost the mental model of the code they changed. Investigating a failure takes longer because they have to re-read their own code. Fixing the failure takes longer because they are now juggling two streams of work.

A well-structured suite - heavy on unit tests, light on E2E - runs in under 10 minutes. Developers run it locally before pushing. Failures are caught while the code is still fresh. The feedback loop is tight enough to support continuous integration.

Flaky tests destroy trust

End-to-end tests are inherently non-deterministic. They depend on network connectivity, shared test environments, external service availability, browser rendering timing, and dozens of other factors outside the developer’s control. A test that fails because a third-party API was slow for 200 milliseconds looks identical to a test that fails because the code is wrong.

When 10% of the suite fails randomly on any given run, developers learn to ignore failures. They rerun the pipeline, and if it passes the second time, they assume the first failure was noise. This behavior is rational given the incentives, but it is catastrophic for quality. Real failures hide behind the noise. A test that detects a genuine regression gets rerun and ignored alongside the flaky tests.

Unit tests and functional tests with test doubles are deterministic. They produce the same result every time. When a deterministic test fails, the developer knows with certainty that they broke something. There is no rerun. There is no “is that real?” The failure demands investigation.

Maintenance cost grows faster than value

End-to-end tests are expensive to write and expensive to maintain. A single E2E test typically involves:

  • Setting up test data across multiple services
  • Navigating through UI flows with waits and retries
  • Asserting on UI elements that change with every redesign
  • Handling timeouts, race conditions, and flaky selectors

When a feature changes, every E2E test that touches that feature must be updated. A redesign of the checkout page breaks 30 E2E tests even if the underlying behavior has not changed. The team spends more time maintaining E2E tests than writing new features.

Unit tests are cheap to write and cheap to maintain. They test behavior, not UI layout. A function that calculates a discount does not care whether the button is blue or green. When the discount logic changes, one or two unit tests need updating - not thirty browser flows.

It couples your pipeline to external systems

When most of your tests are end-to-end or integration tests that hit real services, your ability to deploy depends on every system in the chain being available and healthy. If the payment provider’s sandbox is down, your pipeline fails. If the shared staging database is slow, your tests time out. If another team deployed a breaking change to a shared service, your tests fail even though your code is correct.

This is the opposite of what CD requires. Continuous delivery demands that your team can deploy independently, at any time, regardless of the state of external systems. A test architecture built on E2E tests makes your deployment hostage to every dependency in your ecosystem.

A suite built on unit tests, functional tests, and contract tests runs entirely within your control. External dependencies are replaced with test doubles that are validated by contract tests. Your pipeline can tell you “this change is safe to deploy” even if every external system is offline.

Impact on continuous delivery

The inverted pyramid makes CD impossible in practice even if all the other pieces are in place. The pipeline takes too long to support frequent integration. Flaky failures erode trust in the automated quality gates. Developers bypass the tests or batch up changes to avoid the wait. The team gravitates toward manual verification before deploying because they do not trust the automated suite.

A team that deploys weekly with a 40-minute flaky suite cannot deploy daily without either fixing the test architecture or abandoning automated quality gates. Neither option is acceptable. Fixing the architecture is the only sustainable path.

How to Fix It

Inverting the pyramid does not mean deleting all your E2E tests and writing unit tests from scratch. It means shifting the balance deliberately over time so that most confidence comes from fast, deterministic tests and only a small amount comes from slow, non-deterministic ones.

Step 1: Audit your current test distribution (Week 1)

Count your tests by type and measure their characteristics:

Test type Count Total duration Flaky? Requires external systems?
Unit ? ? ? ?
Integration ? ? ? ?
Functional ? ? ? ?
E2E ? ? ? ?
Manual ? N/A N/A N/A

Run the full suite three times. Note which tests fail intermittently. Record the total duration. This is your baseline.

Step 2: Quarantine flaky tests immediately (Week 1)

Move every flaky test out of the pipeline-gating suite into a separate quarantine suite. This is not deleting them - it is removing them from the critical path so that real failures are visible.

For each quarantined test, decide:

  • Fix it if the behavior it tests is important and the flakiness has a solvable cause (timing dependency, shared state, test order dependency).
  • Replace it with a faster, deterministic test that covers the same behavior at a lower level.
  • Delete it if the behavior is already covered by other tests or is not worth the maintenance cost.

Target: zero flaky tests in the pipeline-gating suite by end of week.

Step 3: Push tests down the pyramid (Weeks 2-4)

For each E2E test in your suite, ask: “Can the behavior this test verifies be tested at a lower level?”

Most of the time, the answer is yes. An E2E test that verifies “user can apply a discount code” is actually testing three things:

  1. The discount calculation logic (testable with a unit test)
  2. The API endpoint that accepts the code (testable with a functional test)
  3. The UI flow for entering the code (testable with a component test)

Write the lower-level tests first. Once they exist and pass, the E2E test is redundant for gating purposes. Move it to a post-deployment smoke suite or delete it.

Work through your E2E suite systematically, starting with the slowest and most flaky tests. Each test you push down the pyramid makes the suite faster and more reliable.

Step 4: Replace external dependencies with test doubles (Weeks 2-4)

Identify every test that calls a real external service and replace the dependency:

Dependency type Test double approach
Database In-memory database, testcontainers, or repository fakes
External HTTP API HTTP stubs (WireMock, nock, MSW)
Message queue In-memory fake or test spy
File storage In-memory filesystem or temp directory
Third-party service Stub that returns canned responses

Validate your test doubles with contract tests that run asynchronously. This ensures your doubles stay accurate without coupling your pipeline to external systems.

Step 5: Adopt the test-for-every-change rule (Ongoing)

New code should be tested at the lowest possible level. Establish the team norm:

  • Every new function with logic gets a unit test.
  • Every new API endpoint or integration boundary gets a functional test.
  • E2E tests are only added for critical smoke paths - not for every feature.
  • Every bug fix includes a regression test at the lowest level that catches the bug.

Over time, this rule shifts the pyramid naturally. New code enters the codebase with the right test distribution even as the team works through the legacy E2E suite.

Step 6: Address the objections

Objection Response
“Unit tests with mocks don’t test anything real” They test logic, which is where most bugs live. A discount calculation that returns the wrong number is a real bug whether it is caught by a unit test or an E2E test. The unit test catches it in milliseconds. The E2E test catches it in minutes - if it is not flaky that day.
“E2E tests catch integration bugs that unit tests miss” Functional tests with test doubles catch most integration bugs. Contract tests catch the rest. The small number of integration bugs that only E2E can find do not justify a suite of hundreds of slow, flaky E2E tests.
“We can’t delete E2E tests - they’re our safety net” They are a safety net with holes. Flaky tests miss real failures. Slow tests delay feedback. Replace them with faster, deterministic tests that actually catch bugs reliably, then keep a small E2E smoke suite for post-deployment verification.
“Our code is too tightly coupled to unit test” That is an architecture problem, not a testing problem. Start by writing tests for new code and refactoring existing code as you touch it. Use the Strangler Fig pattern - wrap untestable code in a testable layer.
“We don’t have time to rewrite the test suite” You are already paying the cost of the inverted pyramid in slow feedback, flaky builds, and manual verification. The fix is incremental: push one test down the pyramid each day. After a month, the suite is measurably faster and more reliable.

Measuring Progress

Metric What to look for
Test suite duration Should decrease toward under 10 minutes
Flaky test count in gating suite Should reach and stay at zero
Test distribution (unit : integration : E2E ratio) Unit tests should be the largest category
Pipeline pass rate Should increase as flaky tests are removed
Developers running tests locally Should increase as the suite gets faster
External dependencies in gating tests Should reach zero