A/B Testing Sample Size: Method Based on Statistical Hypothesis

One of the most popular questions app publishers ask is how much traffic they need to run valid A/B tests. Unfortunately, there is no answer with a magic number that will fit every single experiment. An optimal traffic volume for mobile A/B testing is individual and depends on such factors as a traffic source, app’s conversion rate, and targeting.
Now let’s get to the main point: how to determine the sample size for A/B tests? It’s really important to have a full understanding of it as sample size has a considerable effect on checking the significance of the observed difference in variations performance.
In this post, we’ll also review one of the A/B test sample size measuring methods which is widely used and helps to make a statistically valid decision based on the results of your mobile A/B testing.
Let’s consider that MSQRD decided to check if changed order of screenshots (Variation B) favors a better conversion rate. Presume that we got the following results after filling variations with 200 different users each:
Thus, the observed difference in variations performance shows that the result is statistically significant at the confidence level of 95%. The picture above shows the result of such validation performed using online mobile A/B testing calculator.
Let’s imagine we didn’t finish the experiment at reaching the above-mentioned result and continued driving traffic. When each variation got 500 users, we got the following results:
In this case, the significance checking will show that the observed performance difference isn’t that statistically significant at 95% confidence level.
Is the example we examined realistic? Sure, it is.
For instance, when the exact conversion values of variations A and B are 20% and 26% respectively, these values are within the appropriate confidence intervals for cases with both 200 and 500 visitors per variation.
According to this example, if we finished the experiment at reaching 200 visitors for each variation, it would be possible to come to the conclusion that variation B performed better. However, if we finished the test after having 500 visitors on each product page variant, we could conclude that both variations are interchangeable. Pretty confusing, isn’t it?
It raises the legitimate question:
How many users do we need to run trustworthy mobile A/B tests?
Thus, we need to figure out what sample size is necessary for getting statistically significant results in the course of our mobile A/B testing.
Now, let’s review how to calculate a sample size for A/B tests based on statistical hypothesis testing.
First, we need to understand what null hypothesis really is. In mobile A/B testing, the null hypothesis is normally represented by the assumption that the difference between the performances of variations A and B equals to zero.
It has been theoretically proven that the sample size required for acceptance/rejection of the null hypothesis for KPI expressed in terms of the proportion (conversion rate in our case) depends on 5 of the following parameters:
Let’s clarify the above-mentioned parameters and determine the sample size for our MSQRD example:
Thus, the conversion rate value of our control variation A is 20% (CR(A) = 0.2). Our example presumes that:
In the course of sample size determination, some calculators for A/B testing request minimum conversion rate difference to be formulated in relative terms instead of absolute. In our example, the minimum difference of 6% in absolute terms corresponds to the relative difference of 30% (20% * 0.3 = 6%).
As it was clarified in our post on mobile A/B testing results analysis, the sum of confidence level and significance level values should be 100%. Let’s choose the confidence level of 95% and the significance level of 5% for our MSQRD example as these are the values of the parameters which are most commonly used in A/B tests.
Let’s denote the confidence level by α, then the significance level – by 1- α. Mind that the significance level is the probability of erroneously rejecting the null hypothesis (type I error). In the context of mobile A/B testing, it presupposes the probability of concluding that the conversion rates of variations A and B differ when in fact they are equal.
There is another error type which should be minimized as well. This error doesn’t reject the null hypothesis when it’s erroneous (type II error). When it comes to mobile A/B testing, it means concluding that the conversion rates of variations A and B are equal when they differ actually.
In statistics, the probability of type II error is usually denoted by β, and the value equal to 1- β is called statistical power. The value of statistical power in the course of A/B tests is 80% as a rule.
The choice of a one- or two-tailed testing depends on what we want to check:
Here’s an A/B test sample size formula for one-tailed tests:
In case of a two-tailed testing, we use the following A/B testing
sample size formula:
n1 – the number of visitors for each variation А and В in case of a one-tailed test;
n2 – the number of visitors for each variation А and В in case of a two-tailed test;
Z – standard score or Z-score.
The only difference between these two A/B testing sample size formulas is that Z(α) is used in the first one while the second uses Z(α/2).
The values of Z(α), Z(α/2) and Z(1-β) can be calculated with the help of the Excel function NORM.S.INV:
At making the calculations for the MSQRD example we’ve mentioned above and rounding the results, we’ll get the following values:
These calculations can be made with the help of the free-to-use software Gpower.
Therefore, if CR(A) is 20% and the estimated CR(B) value is at least 26%, we’ll have to run our experiment until each variation gets 608 different visitors to check the statistical significance at the significance level of 5% and with 80% statistical power. Thus, the total number of the experiment visitors should be 1216.
In case we are interested in both positive and negative conversion rate differences, the results will be slightly different.
If CR(A) is 20% and we need to find 6% difference in absolute terms, we’ll have to fill each variation with 772 different visitors to check the statistical significance at the significance level of 5% and with 80% statistical power. Therefore, the total number of visitors within this mobile A/B testing should be 1544.
Thus, if n is the sample size calculated according to the method described above, to compare variation B with control variation A, we’ll need n visitors for variation A and n visitors for variation B (the total is 2*n visitors).
But what should we do if the experiment we run has 3 variations?