Advancing Mobile A/B Testing with Bayesian Multi-Armed Bandit

In the course of an A/B experiment, the correct calculation of a sample size is one of the key success ingredients. Yet, sometimes the amounts of traffic necessary for statistical significance of tests put app publishers off. Indeed, a required sample size can be large that means a test lasts longer than you’d like.

However, this obstruction is not that dramatic if you run your A/B tests with help of SplitMetrics. The thing is the platform can apply an alternative approach called Bayesian Multi-armed Bandit (MAB), which can solve the above-mentioned drawback without even bothering you.

A Bayesian Multi-armed Bandit test allows choosing an optimal variation of the two or more. Unlike a classic A/B test, which is based on statistical hypotheses testing, a Bayesian MAB test proceeds from Bayesian statistics. In this post, we’ll learn more about the principles behind it.

Why “Multi-Armed Bandit”?

Our problem of choosing an optimal variation can be compared with so-called “Multi-armed bandit problem”.

Imagine, you are in a casino. There are many different slot machines, so-called ‘one-armed bandits’, as they are known for robbing you. Let’s think that some slot machines payout more frequently than others.

If you knew beforehand which machine had the highest payoff, you would play that one exclusively to maximize your expected return from playing. The difficulty is that you do not know beforehand which machine is the best. The resources are limited – if you pull one lever (arm), then you are not pulling another arm.

The goal is to walk out of the casino with as much money as possible. The problem is, how do you learn which slot machine is the best and get the most money in the shortest amount of time?

The straightforward analogy is to imagine different app store product page variations as a row of slot machines, each with its own probability of producing a conversion.

Difference between A/B/n and Bayesian Multi-Armed Bandit Tests

In a traditional A/B/n test, we determine a sample size and run the test until each variation gets a determined number of visitors. When our test is over, we use its results to choose a winning variation and upload it to the store. So in the course of our tests, we learn or explore and then – earn or exploit.

screenshots order tests with SplitMetrics

Let’s consider that MSQRD decided to check if a changed order of screenshots (Variation B) favors better conversion rate. If conversion rate of the variation A (CR(A)) is 20% and the expected conversion rate of variation B (CR(B)) is 26%, then the sample size for an A/B test to check the statistical significance at the significance level of 5% and with 80% statistical power is 608 visitors per variation.

You can check out all calculations behind it in our post on determining sample size for A/B testing.

If the conversion rate of variation B is really 26%, we’ll get 279 conversions during the test with 608 visitors per variation:

1216 * 0.5 * (CR(A) + CR(B)) = 608 * (0.2 + 0.26) = 279.68.

If all 1216 visitors (608 * 2) were directed to the best variation B, then it would be 316 conversions (1216 *0.26 = 316.16).

The difference between the actual conversions in the course of your experiment and the conversions you would get had you send every visitor to the optimal variation is called a regret.

In our A/B test example, the regret is.

The main idea behind a Multi-Armed Bandit test is to balance between:

  • using the current best guess about the probability of each variation being optimal
  • collecting more data to improve upon these guesses.

This tradeoff between using currently available information and collecting more data is often referred to as explore/exploit problem or earning while learning.

There are many algorithms for Multi-armed Bandit tests, and they all aim at balancing exploration with exploitation.

In this post, we consider Bayesian Multi-armed Bandit (another name for this algorithm is Thompson Sampling or Randomized Probability Matching). In this algorithm, a variation that appears to perform better (give more conversions) gets more visitors, and a variation with worse performance gets fewer visitors.

Imagine, we will perform a Bayesian Multi-armed Bandit test to choose between variations A and B for MSQRD. Suppose our Bayesian MAB test will last as long as the A/B test considered above but the variation B will get 80% and A – 20% of visitors. We will get 301 conversions as a result of the test with 1216 visitors:

1216 * (0.2 * CR(A) + 0.8 * CR(B)) = 1216 * (0.2 * 0.2 + 0.8 * 0.26) = 301.568.

In our example of Bayesian MAB test, the regret is 15 conversions and it is 2.5 times less than in the A/B test.

Thus, if we treat tests aimed at choosing an optimal variation as multi-armed bandits,  the cost of testing can be reduced dramatically.

In the post “Multi-armed bandit experiments”, the authors describe an experiment to show that Bayesian MAB requires a smaller sample than an A/B test, i.e. shorter duration of testing.

They compared A/B and Bayesian Multi-Armed Bandit tests to choose between variations A and B with expected conversion rates 4% and 5% respectively. The sample size, calculated for the A/B test with 95% confidence level, was 11165 visitors per variation.

In case of 100 visitors per day, the test would take 223 days. They repeated Bayesian MAB test 500 times and got the following results. The correct variation was found in 96.4% of tests (482 of 500), which is about the same error rate as in an A/B test.

There were a few tests where Bayesian Multi-Armed Bandit actually took longer than an A/B test, but only in about 1% of the cases (5 out of 500). The average time saved relative to A/B tests was 171.8 days, and the average number of conversions saved was 98.7.

Bayesian Multi-Armed Bandit Algorithm

The math behind the multi-armed bandit problem is so hard that in practice different approximate heuristic solutions are used. Bayesian MAB is such a heuristic and rather a popular one (e.g. it’s the choice of Google Analytics).

The detailed rationale and description of Bayesian Multi-Armed Bandit technique is given in the article “A modern Bayesian look at the multi-armed bandit”. However, let’s review the main aspects of the algorithm behind Bayesian MAB.

Bayes’ theorem states that for two events Y and Z, the conditional probability of Y given Z is the conditional probability of Z given Y scaled by the relative probability of Y compared to Z: PZ=PY*P(Y)P(Z)

How is Bayes’ theorem applicable to choosing an optimal variation in a test?

Let y = (y1,…,yn) denote the sequence of conversions observed in a test for some variation after n visitors. yi takes the value 1 under the condition of conversion for visitor i and 0 otherwise.

Let θ denote the unknown probability of conversion (a conversion rate) for a variation. Then, a probability of conversion for a variation based on the test results for n visitors (posterior) P(y) can be calculated using the Bayes’ theorem:


  • P(y|θ) – a likelihood of test results y in case of θ;
  • P(θ) – beliefs about θ before testing (prior);
  • P(y) – a constant (as it does not contain θ) and can be ignored.

Let denote µ(θ) a conversion rate for a variation observed in the test. Suppose we test k variations. Then a probability for the variation j to be optimal ωj = P(µj = max{ µ1 ,…, µk}) can be calculated from the formula mentioned above (1). Bayesian MAB allocates visitors to a variation j in proportion to ωj (j changes from 1 to k).

The computation of ωj from formula 1 is rather difficult, but by simulation it’s easy. For each variation, θ can be simulated as independent random variable from beta-distribution Beta(α, β), where


Here α(0), β(0) are priori values for α and β (set equal to 1, as a rule).

For k variations, one should simulate a large table containing samples of θ from the relevant beta distributions, where the columns represent k variations. A Monte Carlo estimate of the probability for a variation to be optimal is the empirical fraction of rows for which considered variation had the largest simulated value.

Let’s consider an example. Suppose in the test for variations A, B, C with the number of visitors 20, 30, 40 we received 12, 20, 30 conversions respectively. Then to simulate θ for variations A, B and C we use distributions Beta(13, 9), Beta(21, 11) and Beta(31, 11) respectively. Below there is a fragment of the table, consisting of samples for each variation.

Bayesian Multi-Armed Bandit
Fragment of samples from θ for variations A, B, and C

Since the maximum value in the first row of the table above (0.74) is in the column corresponding to the variant C, then variant C is assigned to the first row. Analogically, each row is put into correspondence with the variation.

To calculate the probability of being optimal for variation A, one needs to divide the number of rows, variation A was assigned to, by the total number of rows. Probabilities to be optimal for variations B and C are calculated similarly.

Thus, we start the first day of the test with a number of visitors per variation in a proportion to our prior beliefs about variations probabilities to be optimal.

Then periodically, say once per day, we recalculate the probability of each variation to be optimal. That gives us proportions of visitors for the next day. We repeat this process until a set of stopping rules is satisfied.

Google Analytics, for example, by default runs a Bayesian Multi-Armed Bandit test for at least two weeks. Then they keep track of variations probability to be optimal and potential value remaining in the test. They stop the test when they have a variation with probability to be optimal 95% (a champion) and a potential value remaining in the test is less than 1% of the champion’s conversion rate.

Potential value remaining in the test is calculated as the 95th percentile of the distribution of (θmax – θ*) / θ*, where θmax is the largest value of θ in a row, and θ* is the value of θ for the variation that is most likely to be optimal.

Suppose calculations of probabilities to be optimal for variants A, B, and C based on the example presented in table 1, gave the values 0.09, 0.2 and 0.71, respectively. The probability of being optimal for C 0.71 means the value of switching away from variant C is zero in 71% of the cases. Let’s calculate the remaining potential value. The values max-θ*) / θ* are calculated in the fourth column in the table below.

Multi-Armed Bandit test
Example of calculation (θmax-θ*) / θ*

The histogram for the values from the fourth column of table 2 is shown in the left part of the figure below. The 95th percentile of the value distribution is the “potential value remaining” in the test, which in this case works out to be about 0.16. We interpret this number as “We’re still unsure about conversion rate for variation C, but whatever it is, variation A or B might beat it by as much as 16%.”

histogram for the values
The distribution of the value remaining in a test.
The vertical line in each case is the 95th percentile, or the potential value remaining

Suppose we continued the test for some time and the total number of visitors for each variation increased 5 times, i.e. reached 100, 150 and 200 for A, B and C. At the same time, the total number of conversions increased to 60, 100 and 150, respectively.

After calculating the new α and β parameters for the three beta distributions, constructing the samples and computing the optimal variant probabilities, we obtained that the optimal probability for variation C was 0.95. This means that the value of switching to variant A or B is zero in 95% of the cases, and, therefore, the 95th percentile of the value-remaining distribution is zero (right side of the figure above).

When to Use Multi-Armed Bandit Instead of A/B/n Tests?

It’s worth checking a great post discussing this issue: “When to Run Bandit Tests Instead of A/B/n Tests”. Generally speaking, specialists recommend using Multi-Armed Bandit when you care about optimization, rather than understanding. Specifically, MAB algorithms tend to work well for short tests (when you have a small amount of time) – and paradoxically –long (ongoing) tests.

Short Tests

For instance, headlines are the perfect use case for Multi-Armed Bandit tests as news has a short lifetime and MAB tests can determine quickly which headline is better.

preparing app store page for holidays

The same is with short-term campaigns. If a campaign is one week long (for example, if you prepare your app store page for Holidays), you don’t want to spend the week for pure exploration with visitors evenly distributed between variations, because once you learn anything, it’s too late to exploit the best variation.

Long Tests

Multi-Armed Bandit algorithms can be easily automated, that is why they are effective for constantly evolving tests. For example, if we need to determine the best order to display the top 5 news articles on our site regularly, it’s convenient to automate this task with a MAB test.

Another long-term use of Multi-Armed Bandit algorithms is targeting. Some types of users may be more common than others. Contextual MAB can take advantage of this, by applying the learned targeting rules sooner for more common users, while continuing to learn (test) on the rules for the less common user types.

When it comes to mobile A/B testing, Multi-Armed Bandit may become a real saver of marketing budget and time. If you launch your experiments with SplitMetrics A/B testing platform, it’s possible to activate MAB, leave all concerns regarding formidable sample size behind and enjoy statistically significant results without extra expenses.