# Determining Sample Size for A/B Testing: Method Based on Statistical Hypothesis Testing

One of the most popular questions app publishers ask is how much traffic they need to run valid A/B tests. Unfortunately, there is no answer with a magic number that will fit every single experiment. An optimal traffic volume is individual and depends on such factors as a traffic source, app’s conversion rate, and targeting.

However, we can get to the bottom of sample size calculations for А/В tests. It’s really important to have a full understanding of it as sample size has a considerable effect on checking the significance of the observed difference in variations performance.

In this post, we’ll also review one of sample size measuring methods which is widely used and helps to make a statistically valid decision based on the results of your A/B test.

## Sample Size Influence on Checking Results Trustworthiness

Let’s consider that MSQRD decided to check if changed order of screenshots (Variation B) favors better conversion rate. Presume that we got the following results after filling variations with 200 different users each:

• Variation A – 40 converted users;
• Variation B – 57 converted users.

Thus, the observed difference in variations performance shows that the result is statistically significant at the confidence level of 95%.  The picture above shows the result of such validation performed using online A/B testing calculator.

Let’s imagine we didn’t finish the experiment at reaching the above-mentioned result and continued driving traffic. When each variation got 500 users, we got the following results:

• Variation A – 101 converted users;
• Variation B – 127 converted users.

In this case, the significance checking will show that the observed performance difference isn’t that statistically significant at  95% confidence level.

Is the example we examined realistic? Sure, it is.

For instance, when the exact conversion values of variations A and B are 20% and 26% respectively, these values are within the appropriate confidence intervals for cases with both 200 and 500 visitors per variation.

According to this example, if we finished the experiment at reaching 200 visitors for each variation, it would be possible to come to the conclusion that variation B performed better. However, if we finished the test after having 500 visitors on each product page variant, we could conclude that both variations are interchangeable. Pretty confusing, isn’t it?

It raises the legitimate question:

How many users do we need to run trustworthy A/B tests?

Thus, we need to figure out what sample size is necessary for getting statistically significant results.

## How to Calculate Sample Size of A/B Test

Now, let’s review a sample size measuring method which is based on statistical hypothesis testing.

First, we need to understand what null hypothesis really is. In A/B testing, the null hypothesis is normally represented by the assumption that the difference between the performances of variations A and B equals to zero.

It has been theoretically proven that the sample size required for acceptance/rejection of the null hypothesis for KPI expressed in terms of the proportion (conversion rate in our case) depends on 5 of the following parameters:

1. the conversion rate value of our control variation (variation A);
2. the minimum difference between the values of variations A and B conversion rates which is to be identified;
3. chosen confidence/significance level;
4. chosen statistical power;
5. type of the test: one-or two-tailed test.

### Determining Sample Size for our MSQRD A/B Test

Let’s clarify the above-mentioned parameters and determine the sample size for our MSQRD example:

• The conversion rate of variation A: 20% (CR(A) = 0.2);
• The conversion rate of variation B: 26% (CR(B) = 0.26).

Thus, the conversion rate value of our control variation A is 20% (CR(A) = 0.2). Our example presumes that:

• the minimum difference between the conversion values of variations A and B is 6%  in absolute terms;
• variation B performed better than variation A (CR(B) = 0.26).

In the course of sample size determination, some calculators for A/B testing request minimum conversion rate difference to be formulated in relative terms instead of absolute. In our example, the minimum difference of 6% in absolute terms corresponds to the relative difference of 30% (20% * 0.3 = 6%).

As it was clarified in our post on A/B test results analysis, the sum of confidence level and significance level values should be 100%. Let’s choose the confidence level of 95% and the significance level of 5% for our MSQRD example as these are the parameters values which are most commonly used in A/B tests.

### Type I Error

Let’s denote the confidence level by α, then the significance level – by 1- αMind that the significance level is the probability of erroneously rejecting the null hypothesis (type I error). In the context of A/B testing, it presupposes the probability of concluding that the conversion rates of variations A and B differ when in fact they are equal.

### Type II Error

There is another error type which should be minimized as well. This error doesn’t reject the null hypothesis when it’s erroneous (type II error). When it comes to A/B testing, it means concluding that the conversion rates of variations A and B are equal when they differ actually.

In statistics, the probability of type II error is usually denoted by β, and the value equal to 1- β is called statistical power. The value of statistical power in the course of A/B tests is 80% as a rule.

The choice of a one- or two-tailed test depends on what we want to check:

• A one-tailed test is used if we want to check the significance of the observed positive difference in variations conversion rates (i.e. our goal is to replace variation A with variation B if the latter has better conversion rate).
• A two-tailed test is used if we want to check whether CR(B) and CR(A) differ (i.e. we are interested in both positive and negative difference)

To calculate a sample size for one-tailed tests, the following formula is used:

In case of a two-tailed test, we use the following formula:

n1 – the number of visitors for each variation А and В in case of a one-tailed test;

n2 – the number of visitors for each variation А and В in case of a two-tailed test;

The only difference between these two formulas is that Z(α) is used in the first one while the second uses Z(α/2).

The values of Z(α), Z(α/2) and Z(1-β) can be calculated with help of the Excel function NORM.S.INV:

At making the calculations for the MSQRD example we’ve mentioned above and rounding the results, we’ll get the following values:

These calculations can be made with help of the free-to-use software Gpower.

Therefore, if CR(A) is 20% and the estimated CR(B) value is at least 26%, we’ll have to run our experiment until each variation gets 608 different visitors to check the statistical significance at the significance level of 5% and with 80% statistical power. Thus, the total number of the experiment visitors should be 1216.

In case we are interested in both positive and negative conversion rate differences, the results will be slightly different.

If CR(A) is 20% and we need to find 6% difference in absolute terms, we’ll have to fill each variation with 772 different visitors to check the statistical significance at the significance level of 5% and with 80% statistical power. Therefore, the total number of the A/B test visitors should be 1544.

Thus, if n is the sample size calculated according to the method described above, to compare variation B with control variation A, we’ll need n visitors for variation A and n visitors for variation B (the total is 2*n visitors).

But what should we do if the experiment we run has 3 variations?

Let’s imagine we add the third variation C to our MSQRD experiment. This variation C should be compared with variation A (the control one). Therefore, we’ll have to fill variation C with n visitors as well and make the total number of experiment visitors equal to 3*n.

It’s really important to remember that the smaller the difference between the conversion rate values of the variations A and B which is to be identified, the greater sample size is required for the test.

That’s why if the conversion rates difference at the end of our A/B test is less than the expected minimum (the one we used in our sample size calculations), we’ll need to recalculate the sample size taking into consideration the new assumed minimum value and continue the test.

Let’s go back to our MSQRD example and see how it works.

While calculating the original sample size for our one-tail test, we had:

• CR(A) value = 20%
• CR(B)  minimal value = 26%

So we planned to finish the experiment when each variation is visited by 608 users.

Suppose that we filled each variation with the necessary number of users (1216) and got the following results:

• CR(A) = 20%
• CR(B) = 25%

We see that 25% is less than 26%. So it’s necessary to recalculate our sample size for the absolute minimal difference of 5%. The updated sample size will be 862. Thus, we need to run our test until we draw 254 more visitors to each variation.

If we get CR(A) = 20% and CR(B) = 25% or more at the end of the test with 862*2 visitors, it’ll mean that variation В is significantly better than variation A.

## Afterthought

In this post, we examined the approach which presupposes determining a sample size before we launch our test and analyzing the results only after our experiment is finished. This method is based on statistical hypothesis testing and guarantees results correctness.

This approach has a drawback: with a relatively small conversion rate value of the control variation and a small expected difference between variations A and B, the required sample size is big. There are cases when sample size numbers are really hard to achieve in the course of an A/B test.

However, this problem can be solved with help of multi-armed bandit experiments that are based on applying Bayesian Statistics. Running your tests with SplitMetrics, you’re able to activate a multi-armed bandit and enjoy statistically significant experiments without extra traffic expenses.