A/B Testing, Product Updates — 22 Mar 2022

A/B/n Testing: Choose the Right Type of Experiment with SplitMetrics

Table of Contents

SplitMetrics Bayesian
Multi-armed bandit by SplitMetrics
Sequential A/B testing by SplitMetrics

Unleash the full potential of A/B testing with core statistical methods based on your experiment goals with the advanced SplitMetrics experience.

SplitMetrics Optimize offers three methods within one platform.

No more gut feeling in running experiments. Growth, UA, and marketing managers can flexibly manage experiments and choose the most efficient A/B/n methods to skyrocket CRO or check concepts and hypotheses depending on goals. SplitMetrics’ comprehensive solution and dedicated team of experts will help you choose the best and most efficient way to run your experiments.

Using a spade for some jobs and a shovel for others does not require you to sign up to a lifetime of using only Spadian or Shovelist philosophy or to believe that only spades or shovels represent the One True Path garden neatness. There are different ways of tackling statistical problems, too.
– Ken Rice, Department of Biostatistics of University of Washington

Enhance your A/B testing with the Bayesian approach as an industry gold standard for iterative A/B testing and growth hacking with less traffic required. Marketers and growth managers can check the effect daily and don’t overspend by finishing tests earlier with enough data for conclusions. The SplitMetrics platform offers an early stopping experiment by toc (threshold of caring) with the expected loss tolerance if there’s an overperformer, underperformer, or approximately equal with the opportunity to spend less on a test.
Use Multi-armed bandit by SplitMetrics for testing up to 8 variations and reduce the budget by automatically excluding bad-performers. Bandit algorithms allow you to adjust in real-time and quickly send more traffic to better variation. It helps maximize conversions when there’s no time for gathering statistically significant results.
Use the Sequential method to check global ideas and hypotheses, and more complex experiments with more traffic are required. Determine the exact significance level and analyze the performance of all variations after an experiment is finished. This algorithm allows you to take complete control over the experiment results. You can change the Significance Level and MDE that affect the required traffic to end the experiment and the probability of getting the wrong results.

A/B/n Testing: Choose the Right Type of Experiment with SplitMetrics

SplitMetrics Bayesian

The Bayesian testing is a new enhanced approach in experiments provided by SplitMetrics.

The Bayesian approach can be helpful in cases where marketers have some beliefs and knowledge to use as a primary assumption (in our case, it is informed prior in the default settings) that helps algorithms calculate the probability of related events to the likelihood of a specific outcome. Hence, users can make faster decisions with lower costs of experiments by incorporating beliefs or knowledge as part of the experiment, compared with the Frequentist or Sequential methods.

The benefits of the Bayesian approach in SplitMetrics are that users can find indirect control over the risk because they can control the expected loss stopping rule (threshold of caring).

The stopping rule – stops the test and reduces budgets once the winner or underperformer is obvious.

A Bayesian test starts from a weak prior assumption about the expected conversion (prior distribution). The prior may fix the noise at the beginning of the test when the sample size is small. As the test progresses and more visitors participate, the impact of the prior fades.

Leverage the Bayesian tests for growth hacking and iterative A/B testing if you need quick results with less traffic required.

What Bayesian cycle looks like

Test setup: utilizing assumptions based on prior test history
Warm-up stage (250 unique visitors per variation) + Knowledge update
Activation of stopping rule algorithm based on real data + Knowledge update
The winner is declared if the expected loss hits the threshold.

Three simultaneous criteria for using the Bayesian A/B testing

The constant flow of cheap-to-produce and hard-to-test ideas
The goal is to increase the conversion rate
The implementation scale is defined in scope and time.

Otherwise, if you don’t meet all of the criteria in your growth-hacking model, you’d better use the Sequential method.

Where you should probably choose an alternative instead of the Bayesian

A single idea to test and a long release cycle: growth managers don’t expect new features to challenge and replace the tested ones in the foreseeable future. Therefore they wouldn’t want to stick with a valueless feature indefinitely.

There is a chance of getting stuck with a slight change (deterioration). Hence, it can accumulate a significant effect over time.

Indefinite implementation scale: the test result is the shared knowledge that can be considered at an uncontrollable scale across the company. Checking hypotheses to make changes in products, marketers and product managers know how exactly changes will be implemented. But they might gain constant knowledge for big companies and enterprises with no changes (for instance, green background constantly improves conversion rate). With the Bayesian method, you can get stuck with deterioration and scale improvements unpredictably.

The possible flaws of Bayesian in some contexts are the side effects of treating control and test groups equally and controlling for cost/value rather than False Positive/False Negative error rate.

A conservative strategy is rationally justified

Low-trust environment: if you are not testing in person, your team’s decision-makers are incentivized to maximize the number of releases. Thus they tend to take excessive risks.

If you want a strict rule limiting the team’s discretion, you should try the Sequential approach with a standard p_value threshold.

Risk aversion: rolling out a variant with close-to-zero value is unacceptable because your release cost is high.

Even if there is a slight difference between variations in the Bayesian approach, the one that looks at least a little better at the moment will be selected as recommended. Lean towards a conservative strategy with the Sequential approach that ensures significant statistical data for each variation to make conclusions for your global and high-cost-to-implement ideas.

The main value of Bayesian A/B testing

Online cost optimization

Early stopping by toc (threshold of caring) with the expected loss tolerance if there’s an overperformer, underperformer, or approximately equal with the opportunity to spend fewer budgets on a test.

The Frequentist test came from the academic setup, where a false positive result is a major failure by default: you mislead the entire scientific community by reporting one. Thus, you are conservative by default: you prefer sticking to the current state and sacrificing exploitation for the low chance of being even marginally wrong.

In most commercial applications, decision-makers are concerned not with a false positive rate per se but with continuous product improvement. It is okay to use less traffic and make minor mistakes as long as you over-compensate them with significant improvements—the Bayesian test controls for the magnitude of potential error rather than a fraction of false-positive results.
Imagine running a series of ten experiments. Your current conversion is 20%. Three of the tested features increase the target conversion by 0.01%, another three decrease it by 0.01%, one is a killer feature with a +5% effect, and the last one is a bug that would cause a 10% conversion decrease. You don’t care much about minor improvements and losses in a typical growth hacking set. You need to make sure you have dumped the bug and released the killer feature. You can save a lot of traffic if you reformulate your task from academia-inherited false-positive-intolerance to the industrial big-error-intolerance concept. The Bayesian test has an interface for this, which brings us to the next point.

Interpretability

The Bayesian A/B testing in the SplitMetrcis platform showcase a winning probability for experiment interpretability.

Bayesian delivers comprehensible business insights.
The output looks like this: “B is 87% likely to outperform A. You may either stop the test and gamble on this or spend more traffic to be more sure.”
In the Frequentist test paradigm, you would probably get just “pvalue>0.05, the result is insignificant”.

Take advantage of Bayesian tests for growth hacking and iterative A/B testing if you need quick results with less traffic required. Contact our experts to start your experiments with SplitMetrics.

Multi-armed bandit by SplitMetrics

A multi-armed bandit approach allows users to dynamically allocate traffic to perform well, while distributing less traffic in real-time to underperforming variations. This testing approach produces faster results since there’s no need to wait for a single winning variation.

This approach helps users effectively convey experiments with many variations while saving costs.

What multi-armed bandit testing cycle looks like

Exploration phase: try each arm N times
Exploitation phase: play arm A in all remaining rounds: based on chosen Strategy. SplitMetrics offer Thompson Sampling Algorithm

The multi-armed bandit differentiators

Marketers and product managers usually choose this type of A/B testing because of the smooth transition between exploration and exploitation, speed, and automation. It is more relevant for experiments with many variations. The SplitMetrics platform offers up to 8 variations per experiment. The automatic allocation of traffic minimizes the costs of experiments with no need to spend more budget for ineffective variations. By eliminating unnecessary and weak options, users simultaneously increase the likelihood of finding the most efficient option.

Bandit algorithms are relevant for short tests while allowing you to adjust in real-time and send more traffic more quickly for better variation. It’s applicable for seasonal promo campaigns and one-time promotions to choose the best option for budget allocation promptly.

Be careful when the algorithm sends more traffic to higher-performing content. It is likely to reinforce slight differences in low-traffic experiments and skewed results. There is a risk to considering a non-optimal option as optimal because there is less traffic: the fewer data, the less reliable the estimate.

The main value of Multi-armed bandit A/B/n testing

Online cost optimization

It gradually moves traffic towards winning variations instead of forcing users to wait for a “final answer” at the end of an experiment. Hence, it speeds up the experiment as samples that would have gone to inferior variations can be assigned to potential winners. The extra data collected on the high-performing variations can help quickly separate the “good” arms from the “best” ones.

Flexibility/Controls

Bandit tests are adaptive and continue to explore even when exploited (works well if there are many groups and variations.) Throw out the weak option (depends on Strategy) and spend less on traffic.

Guarantees

The SplitMetrics platform offers “the probability to be the best.”

Try the enhanced A/B/n testing experience with SplitMetrics.

Sequential A/B testing by SplitMetrics

The object of sequential sampling is to reduce the sample size, which must be drawn to decide on a population within specified error limits.

“Sequential sampling allows the experimenter to stop the trial early if the treatment appears to be a winner; it therefore addresses the “peeking” problem associated with eager experimenters who use (abuse) traditional fixed-sample methods. The above procedure can, in some circumstances, reduce the number of observations required for a successful experiment by 50% or more. The procedure works extremely well with low conversion rates (with small Bernoulli parameters). It works less well with high conversion rates, but traditional fixed-sample methods should do you just fine in those cases.”
– Evan Miller

Sequential A/B testing cycle for testing hypotheses

Start your experiment by choosing the sample size (total conversions) N
Randomly assign variations under test to the treatment and control, with 50% probability each.
Track the number of incoming successes for both variations. Let’s refer to the conversion rate of treatment variation as T and CR of control as C.
It’s necessary to finish the test when T−C reaches √2N and declare the treatment variation to be the winner of your A/B test.
It’s essential to finish the test when T+C reaches N. In such cases, declare that the experiment had no winner. Detailed calculations which support this workflow can be found in this article by Evan Miller.

Sequential testing differentiators

Sequential test designs allow users to analyze data while the test runs to determine if an early decision can be made. Applicable for experiments with lower conversions. Done incorrectly, this is known as “peeking” and increases the risk of false-positive/negative errors. Relevant for experiments where you need to determine the exact significance level. You can analyze the performance of all variations after an experiment is finished.

Sequential A/B tests by SplitMetrics enable you to analyze how users behave on your app store product page, what elements draw their attention, and whether they tend to scroll through your screenshots and watch app previews before pressing the download button or leaving the page.

Relevant for testing global ideas on pre-launch, more applicable for massive experiments. Can reduce the chance of mistakes to reach the max. approximate solution.

The main value of Sequential testing

Online cost optimization

Early stopping – works better with lower basic conversions (on higher conversions we need more traffic than in the classical Frequentist approach)

The calculations by Evan Miller demonstrate that:

The less MDE, the more traffic is saved in the case of sequential sampling. It’s possible to save 40% or even more at times.
There are cases when sequential sampling requires more conversions than classic A/B testing.
To ensure that the Sequential sampling method suits you, do the following calculations:

1.5 x Baseline Conversion + Minimum Detectable Effect

If the result exceeds 36%, the classic approach to A/B testing will help to finish your experiment earlier. If the result is less than 36%, it makes sense to opt for sequential A/B testing as it will help you get reliable results faster using less traffic. Considering relatively low conversion in such popular app categories, like games or photo & video, sequential sampling is a fantastic opportunity to speed up results in mobile A/B testing.

Interpretability

The complexity of interpretations – either the option won, or we have no information (challenging to interpret into business metrics – users don’t know how it will affect conversion rate growth).

Flexibility/Controls

You can finish the test automatically if the effect exceeds expectations (early stopping). You can influence the experiment, error level, and significance level and define the sample size.

Guarantees

Sequential gives guarantees with proportions of false positive and false negative probabilities – alpha and beta.

Start using Sequential testing to check your high-cost global ideas with the SplitMetrics platform and dedicated team of experts.

Boost conversion and installs with SplitMetrics A/B testing

Request Demo

Share this article

Lesia Polivod

Product Marketing Manager at SplitMetrics

Conveys product value to the customers

Read all articles