Flaky tests are automated tests that produce inconsistent results, passing and failing without any code changes. Research shows they affect up to 16% of tests in large codebases and consume 15-20% of developer debugging time, making them one of the most persistent problems in modern software engineering.
Table of Contents
- What Are Flaky Tests?
- Why Flaky Tests Matter
- 8 Root Causes of Flaky Tests
- Detection Strategies
- Step-by-Step Debugging Process
- Fixing Strategies
- Tools for Managing Flaky Tests
- Case Study: Reducing Flakiness by 85%
- Prevention Best Practices
- Flaky Test Management Checklist
- Frequently Asked Questions
What Are Flaky Tests?
A flaky test is an automated test that produces non-deterministic results. It passes on one execution and fails on the next, despite running against the same codebase with no changes to the code, configuration, or test inputs. This inconsistency makes it impossible to tell whether a failure represents a genuine bug or just noise.
Consider a test that validates a user registration workflow. On Monday morning, it passes across all environments. By Tuesday afternoon, the same test fails in CI but passes when a developer reruns it locally. No one changed the code. No one touched the test. Yet the result changed. That is flakiness in action.
Flaky tests exist on a spectrum. Some fail once in every hundred runs, making them nearly invisible until they accumulate. Others fail in 30-40% of executions, creating constant disruption. The most dangerous flaky tests are the subtle ones because teams learn to ignore all test failures when they assume every failure might be a false alarm.
The scope of this problem is significant. Studies from Google's engineering teams found that roughly 1.5% of all test runs produced flaky results, and approximately 16% of their test cases exhibited some degree of flakiness over time. At that scale, flakiness becomes a systemic issue rather than a minor annoyance.
Why Flaky Tests Matter
Flaky tests are not just a technical inconvenience. They create measurable damage across engineering organizations in several critical ways.
Eroded trust in the test suite. When developers see intermittent failures regularly, they begin to dismiss all failures as noise. This behavior means real bugs slip through because genuine failures get lost among flaky ones. Teams that tolerate flakiness eventually stop treating test results as reliable signals, undermining the entire purpose of automated testing.
Wasted developer time. Engineering teams report spending 15-20% of their debugging time investigating flaky test failures. Developers context-switch away from feature work to investigate a failure, only to discover the test passed on rerun. Multiply this across hundreds of developers and thousands of tests and the cost becomes substantial.
CI/CD pipeline degradation. Flaky tests increase build times because teams implement retry mechanisms to compensate. A pipeline that should take 15 minutes now takes 30 because multiple test stages need reruns. This directly impacts deployment velocity and the team's ability to deliver reliably, a core principle of the shift-left approach.
Masked real defects. Perhaps the most dangerous consequence is that flaky tests hide genuine bugs. When a team routinely reruns failed tests, they create a culture where real failures get the same treatment as flaky ones. A legitimate regression that would have been caught immediately now takes days or weeks to surface.
Reduced deployment confidence. Organizations with high flaky test rates often develop a fear of deploying. If the green build might be a false positive and the red build might be a false negative, teams lose the confidence they need to ship frequently.
Want deeper technical insights on testing & automation?
Explore our in-depth guides on shift-left testing, CI/CD integration, test automation, and more.
Also check out our AI-powered API testing platform8 Root Causes of Flaky Tests
Understanding why tests become flaky is the first step toward eliminating them. Research across large-scale test suites reveals eight primary causes, each contributing a different proportion of overall flakiness.
1. Timing Dependencies and Race Conditions (35%)
The single largest source of flakiness comes from assumptions about timing. Tests that use hard-coded sleep statements, rely on specific execution order of asynchronous operations, or assume a network call will complete within a fixed window are inherently non-deterministic. A test that sleeps for two seconds waiting for an API response works perfectly on a fast machine but fails on a loaded CI server where the same call takes three seconds.
2. Test Order Dependencies (20%)
Tests that depend on other tests running first, or on a specific execution sequence, break when test runners parallelize or randomize order. Test A creates a database record, test B reads it. When run sequentially, everything works. When run in parallel or reversed, test B fails because the record does not exist yet.
3. Shared Mutable State (15%)
When tests modify shared resources like databases, in-memory caches, global variables, or file systems without proper isolation, one test's side effects contaminate another. This is especially common in integration tests where a shared database is used across multiple test classes without transaction rollback or data cleanup.
4. Environmental Differences (10%)
Tests that pass on a developer's machine but fail in CI often suffer from environmental dependencies. Differences in OS versions, timezone settings, locale configurations, available memory, CPU speed, or installed system libraries can all cause tests to produce different results. Proper test environment management is essential for eliminating this category.
5. Network and External Service Dependencies (10%)
Tests that make real HTTP calls to external APIs, third-party services, or even internal microservices introduce network variability. DNS resolution delays, service rate limiting, temporary outages, or changed API response structures all create flakiness that has nothing to do with the code under test.
6. Test Data Issues (5%)
Tests that rely on pre-existing data in a shared database, hardcoded IDs that may not exist, or data created by other test runs are vulnerable to data drift. When the expected data changes or disappears, the test fails despite the application logic being correct.
7. Resource Leaks (3%)
Tests that fail to properly close database connections, file handles, network sockets, or browser instances cause resource exhaustion over the course of a test suite run. Early tests pass, but later tests fail because the system has run out of available connections or memory.
8. Platform-Specific Behavior (2%)
Subtle differences in how operating systems handle file paths, line endings, floating-point arithmetic, or character encoding can cause tests to pass on one platform and fail on another. This is particularly common in cross-platform projects that run CI on Linux but have developers working on macOS or Windows.
Detection Strategies
You cannot fix what you cannot measure. Effective flaky test management starts with systematic detection.
Run-and-rerun analysis. Execute your full test suite multiple times (three to five runs) against the same commit. Any test that produces different results across runs is definitively flaky. This is the most reliable detection method but also the most resource-intensive.
Automatic retry tracking. Configure your CI system to automatically retry failed tests once. Any test that fails and then passes on retry gets tagged as potentially flaky. Track retry rates per test over time to build a flakiness profile.
Commit-correlated analysis. When a test fails, check whether the commit that triggered the build modified any code related to that test. If the test fails on a commit that only changed documentation or an unrelated module, it is likely flaky.
Historical trend monitoring. Track each test's pass/fail ratio over a rolling window of 30 to 90 days. Tests with pass rates between 50% and 99% warrant investigation. A test with a 97% pass rate might seem fine, but in a suite of 5,000 tests, that means roughly 150 flaky failures per full run.
Quarantine automation. Implement automated quarantining where tests that fail more than a configured threshold (for example, two failures without code changes within a week) are automatically moved to a quarantine suite and flagged for investigation.
Step-by-Step Debugging Process
When you identify a flaky test, follow this structured debugging process to find and eliminate the root cause.
Step 1: Reproduce the failure. Run the test in isolation at least 20 times to confirm flakiness. If it passes consistently in isolation, run it alongside the full suite to determine if the failure depends on other tests. Record the failure rate because it indicates severity and guides prioritization.
Step 2: Analyze the failure pattern. Examine the failure messages and stack traces. Look for patterns. Does the failure always occur at the same assertion? Does the error message change between failures? Timing-related flakiness often produces different failure points, while state-dependent flakiness typically fails at the same assertion.
Step 3: Check for timing issues. Search the test for hard-coded waits, sleep statements, or fixed timeouts. Look for asynchronous operations that lack proper synchronization. Review any polling mechanisms for adequate timeout values. In UI tests, check for missing explicit waits on element visibility or interactability.
Step 4: Examine state management. Verify that the test creates all data it needs in its setup phase and cleans up in teardown. Check for shared global state, static variables, or singleton objects that persist between tests. Look for database operations that lack transaction isolation.
Step 5: Review environmental assumptions. Check for hardcoded file paths, timezone assumptions, locale-dependent string formatting, or platform-specific behavior. Look for dependencies on specific port numbers, available disk space, or network connectivity.
Step 6: Isolate external dependencies. Identify all external calls (HTTP requests, database queries, file system operations, message queue interactions). Determine which of these can be replaced with mocks, stubs, or fakes to remove variability.
Step 7: Implement the fix and validate. Apply the targeted fix based on your diagnosis. Run the test at least 50 times to confirm stability. Monitor the test for two weeks after the fix to catch any remaining intermittent issues.
Fixing Strategies
Once you have identified the root cause, apply the appropriate fix. The following workflow guides the resolution process for each category of flakiness.
Replacing Timing Dependencies
Remove all hard-coded sleep statements and replace them with explicit wait conditions. Instead of sleeping for five seconds, wait until the specific condition is met, such as an element becoming visible, an API response arriving, or a database record appearing. Set a maximum timeout to prevent infinite waits, but let the test proceed as soon as the condition is satisfied. This approach is both faster and more reliable.
Isolating Test State
Each test should create the exact data it needs during setup and remove it during teardown. Use database transactions that roll back after each test, or create uniquely named resources (using UUIDs or timestamps) that cannot collide with other tests. Never depend on data created by a previous test or a shared seed script that other tests might modify.
Mocking External Dependencies
Replace real HTTP calls to external services with mock responses. Use tools like WireMock, MSW (Mock Service Worker), or your framework's built-in mocking to return predictable responses. For tests that must verify real integration, isolate those into a separate integration test suite with appropriate timeout handling and retry logic, following the principles outlined in our test automation guide.
Containerizing Test Environments
Run tests inside Docker containers or similar isolated environments to eliminate platform and configuration differences. Define the exact OS, runtime version, timezone, locale, and system dependencies in a container image. This ensures every test execution occurs in an identical environment regardless of where it runs.
Eliminating Order Dependencies
Configure your test runner to randomize test execution order. This immediately surfaces any hidden order dependencies. Fix each dependency by ensuring tests set up their own preconditions rather than relying on side effects from other tests. In frameworks like Jest, use the --randomize flag. In pytest, the pytest-randomly plugin accomplishes the same goal.
Tools for Managing Flaky Tests
Several specialized tools help teams detect, track, and resolve flaky tests at scale.
| Tool | Type | Key Capability | Best For |
|---|---|---|---|
| BuildPulse | SaaS | Automatic flaky test detection with CI integration | Teams using GitHub Actions, CircleCI, or Jenkins |
| Launchable | SaaS | ML-based test selection and flakiness prediction | Large test suites needing intelligent prioritization |
| Flaky Test Handler | Open source | JUnit plugin for retry and quarantine automation | Java/Kotlin projects with JUnit 5 |
| pytest-rerunfailures | Open source | Automatic rerun of failed tests with configurable count | Python projects using pytest |
| Cypress Retry | Built-in | Native test retry with screenshot comparison | Frontend teams using Cypress |
| Playwright Test | Built-in | Auto-retries with trace recording on failure | E2E testing with detailed failure diagnostics |
| Trunk Flaky Tests | SaaS | Detects flaky tests and manages quarantine workflows | Organizations tracking flakiness across repositories |
| TSL Platform | Platform | Integrated test stability monitoring and analytics | Enterprise teams needing end-to-end visibility |
Case Study: Reducing Flakiness by 85%
A mid-size fintech company running 4,200 automated tests across their microservices architecture was experiencing a flaky test rate of 14%. Their CI pipeline required an average of 2.3 reruns per build, adding 25 minutes to every deployment cycle. Developers reported spending approximately four hours per week investigating false failures.
The team implemented a structured remediation program over eight weeks. In the first two weeks, they instrumented their CI system with retry tracking and historical trend analysis, which identified 588 tests exhibiting flakiness. They categorized each flaky test by root cause using the framework described above.
The analysis revealed that 42% of their flaky tests stemmed from timing issues in API integration tests that used hard-coded sleep statements. Another 28% came from shared database state between tests running in parallel. The remaining 30% split across environment differences, test order dependencies, and external service variability.
During weeks three through six, the team systematically addressed each category. They replaced all sleep statements with polling-based waits using exponential backoff. They wrapped each test in a database transaction that rolled back on completion. They containerized their test environments using Docker Compose configurations that mirrored production. For external API calls, they introduced WireMock stubs with recorded response fixtures.
By week eight, their flaky test rate dropped from 14% to 2.1%. Average pipeline time decreased from 38 minutes to 16 minutes. Developer time spent on false failure investigation dropped by 85%. Most importantly, the team's trust in their test suite was restored, and they began catching genuine regressions faster because real failures were no longer buried in noise.
This type of structured approach to test automation reliability applies across organizations of any size.
Prevention Best Practices
Preventing flaky tests is more effective than fixing them after they appear. Incorporate these practices into your development workflow to reduce flakiness at the source.
Design tests for isolation from the start. Every test should be completely self-contained. It creates its own data, performs its assertions, and cleans up after itself. Treat test isolation as a non-negotiable architectural requirement, not an afterthought.
Use explicit waits everywhere. Never use fixed sleep statements in tests. Always wait for a specific condition with a maximum timeout. This applies to UI tests waiting for elements, API tests waiting for responses, and integration tests waiting for asynchronous processes to complete.
Mock at the boundary. For unit tests and most integration tests, mock external dependencies at the service boundary. Reserve real external calls for a dedicated end-to-end suite that runs less frequently and has appropriate tolerance for environmental variability.
Run tests in randomized order. Make randomized test ordering the default in your CI configuration. Fix any tests that break under randomization immediately. This prevents order dependencies from accumulating.
Containerize your test environments. Define test environments as code using Docker or similar tools. Pin all dependency versions. Run CI tests in the same container images every time to eliminate environmental drift.
Enforce test cleanup in code review. Make proper setup and teardown a code review criterion. Reject tests that use shared state without isolation mechanisms. Treat missing cleanup as a defect, not a style preference.
Monitor flakiness continuously. Track flaky test rates as a team metric alongside code coverage and build time. Set thresholds (for example, a maximum of 2% flaky test rate) and treat breaches as production-level incidents that require immediate attention. For teams that need centralized visibility into test stability trends across repositories and pipelines, TotalShiftLeft.ai provides integrated analytics that surface flaky patterns before they erode CI/CD confidence.
Set time limits on quarantined tests. When you quarantine a flaky test, assign it an owner and a two-week deadline. If it is not fixed within the deadline, the team must decide whether to rewrite or remove it. Quarantine without accountability becomes a graveyard where flaky tests accumulate indefinitely.
Flaky Test Management Checklist
Use this checklist to evaluate and improve your team's flaky test management process:
- CI pipeline tracks retry rates per test and flags tests exceeding a threshold
- Historical flakiness data is available for every test in the suite
- Flaky tests are automatically quarantined when they exceed the failure threshold
- Quarantined tests have assigned owners and fix deadlines
- Test execution order is randomized in CI
- All tests create their own data and clean up after execution
- No hard-coded sleep statements exist in the test codebase
- External service calls are mocked in unit and integration tests
- Test environments are containerized and version-pinned
- Flaky test rate is tracked as a team-level metric with a defined target
- New tests are reviewed for isolation and determinism during code review
- Fixed flaky tests are validated with 50 or more consecutive passing runs
Frequently Asked Questions
What are flaky tests?
Flaky tests are automated tests that produce non-deterministic results, sometimes passing and sometimes failing when run against the same code without any changes. Research from Google found that approximately 16% of tests in large-scale systems exhibit some degree of flakiness. They cost significant developer time in investigation and reruns, and they undermine confidence in the entire test suite.
What causes flaky tests?
The most common causes are timing dependencies and race conditions, which account for roughly 35% of cases. Test order dependencies contribute 20%, shared mutable state between tests accounts for 15%, and environmental differences, network dependencies, and test data issues each contribute 5-10%. Most flakiness stems from non-deterministic behavior in test setup, teardown, or external dependencies rather than from the application code itself.
How do you detect flaky tests?
The most reliable method is running your test suite multiple times against the same commit and flagging any test that produces inconsistent results. Supplement this with automatic retry tracking in your CI system, where tests that need retries are tagged for investigation. Use specialized tools like BuildPulse or Trunk Flaky Tests to automate detection and track reliability scores over time. Quarantine tests that fail more than once without corresponding code changes.
How do you fix flaky tests?
Start by reproducing and categorizing the root cause. For timing issues, replace hard-coded waits with explicit wait conditions. For state issues, isolate test data so each test creates and cleans up its own data. For external dependency issues, mock network calls and third-party services. For environment issues, containerize your test execution environment. After applying a fix, validate stability by running the test at least 50 times consecutively.
Should you delete or quarantine flaky tests?
Quarantine first, then fix or delete within a defined timeframe. Move the flaky test to a separate quarantine suite that runs independently from the main CI/CD pipeline so it does not block deployments. Assign an owner and set a two-week deadline for resolution. If the test cannot be fixed within that window, evaluate the underlying scenario. If the scenario is important to verify, rewrite the test from scratch with proper isolation. If the scenario provides minimal value, delete the test rather than letting it persist as permanent noise in your quarantine.
Conclusion
Flaky tests represent one of the most persistent and costly problems in modern software testing. They waste developer time, degrade CI/CD reliability, erode trust in test suites, and allow real defects to escape detection. The good news is that flakiness is solvable with a structured approach.
Start by measuring your current flaky test rate and categorizing failures by root cause. Apply targeted fixes for each category: explicit waits for timing issues, data isolation for state issues, mocking for external dependencies, and containerization for environment issues. Build prevention into your development process through code review standards, randomized test ordering, and continuous flakiness monitoring.
The organizations that treat test reliability as a first-class engineering concern consistently outperform those that tolerate flakiness. By investing in flaky test elimination now, you build the foundation for faster, more confident deployments and a test suite that your entire team trusts.
Continue Learning
Explore more in-depth technical guides, case studies, and expert insights on our product blog:
- Best API Test Automation Tools Compared
- How to Build a Test Automation Framework
- No-Code API Test Automation Platforms
Browse All Articles on Total Shift Left Blog — Your go-to resource for shift-left testing, API automation, CI/CD integration, and quality engineering best practices.
Need hands-on help? Schedule a free consultation with our experts.
Ready to Transform Your Testing Strategy?
Discover how shift-left testing, quality engineering, and test automation can accelerate your releases. Read expert guides and real-world case studies.
Try our AI-powered API testing platform — Shift Left API

