A/B Testing

What is A/B Testing?

A/B testing compares two or more variants of a system by randomly assigning users to groups and measuring statistically significant differences in predefined outcome metrics. In the ML context, A/B tests determine whether a new model, feature, or algorithm actually improves user outcomes compared to the current production version. Unlike offline evaluation that measures proxy metrics, A/B tests capture real-world impact including user behavior changes, business KPIs, and long-term engagement effects.

How does A/B Testing work?

A/B testing randomly splits incoming traffic between control (existing version) and treatment (new version) groups. A hash-based assignment function ensures users consistently see the same variant across sessions. The system logs all relevant metrics for both groups over a predetermined test duration.

Statistical analysis compares group metrics after sufficient data accumulates. Frequentist approaches compute p-values and confidence intervals to determine if observed differences exceed random variation. Bayesian approaches provide probability distributions over effect sizes, enabling nuanced decision-making beyond binary significance testing.

Sample size calculations determine how long tests must run to detect meaningful effects. Smaller expected effects require larger samples. Sequential testing methods allow early stopping when results are conclusive, saving time while maintaining statistical validity.

Multi-armed bandit approaches dynamically shift traffic toward winning variants during the test, maximizing cumulative reward while still learning. This trades off statistical purity for real-time optimization, appropriate when the cost of serving inferior variants is high.

Why does A/B Testing matter?

Offline metrics poorly predict real-world performance — a model with better benchmark scores may worsen user experience through latency, unexpected behavior, or context mismatch. A/B testing provides ground truth about whether changes actually improve outcomes, preventing the deployment of changes that look good on paper but harm users in practice.

Best practices for A/B Testing

Define primary metrics and guardrail metrics before launching tests to prevent post-hoc rationalization of results
Run power analyses to determine minimum sample sizes needed to detect your minimum detectable effect
Monitor for sample ratio mismatch (SRM) that indicates assignment bugs invalidating test results
Account for novelty effects and temporal patterns by running tests for complete business cycles (at least one full week)
Use stratified randomization for small populations to ensure balanced demographic representation across variants

What is A/B Testing?

How does A/B Testing work?

Why does A/B Testing matter?

Best practices for A/B Testing

Related Terms

About the Author