One of the most popular questions app publishers ask is **how much traffic** they need to run valid A/B tests. Unfortunately, there is **no answer with a magic number** that will fit every single experiment. An optimal traffic volume for mobile A/B testing is **individual** and depends on such **factors** as a traffic source, app’s conversion rate, and targeting.

Now let’s get to the main point: how to determine the sample size for A/B tests? It’s really important to have a full understanding of it as sample size has a considerable effect on **checking the significance** of the observed difference in variations performance.

In this post, we’ll also review one of the A/B test **sample size measuring methods** which is widely used and helps to make a statistically **valid decision** based on the results of your mobile A/B testing.

## Sample Size Influence on Checking Results Trustworthiness within Mobile A/B Testing

Let’s consider that MSQRD decided to check if changed **order of screenshots** (Variation B) favors a better conversion rate. Presume that we got the **following results** after filling variations with 200 different users each:

- Variation A – 40 converted users;
- Variation B – 57 converted users.

Thus, the observed difference in variations performance shows that the result is **statistically significant **at the confidence level of **95%**. The picture above shows the result of such validation performed using online mobile A/B testing calculator.

Let’s imagine we didn’t finish the experiment at reaching the above-mentioned result and **continued driving traffic**. When each variation got 500 users, we got the following results:

- Variation A – 101 converted users;
- Variation B – 127 converted users.

In this case, the significance checking will show that the observed performance difference** isn’t that statistically significant** at 95% confidence level.

Is the example we examined realistic? Sure, it is.

For instance, when the exact conversion values of variations A and B are **20% and 26% respectively**, these values are within the appropriate** confidence intervals** for cases with both 200 and 500 visitors per variation.

According to this example, if we finished the experiment at reaching **200 visitors for each variation**, it would be possible to come to the conclusion that variation B performed better. However, if we finished the test after having 500 visitors on each product page variant, we could conclude that **both variations are interchangeable**. Pretty confusing, isn’t it?

It raises the legitimate question:

How many users do we need to run trustworthy mobile A/B tests?

Thus, we need to figure out what sample size is necessary for getting statistically significant results in the course of our mobile A/B testing.

## How to Calculate A/B Testing Sample Size

Now, let’s review how to calculate a sample size for A/B tests based on statistical hypothesis testing.

First, we need to understand what **null hypothesis **really is. In mobile A/B testing, the null hypothesis is normally represented by the assumption that the difference between the performances of variations A and B **equals to zero**.

It has been theoretically proven that the sample size required for **acceptance/rejection** of the null hypothesis for KPI expressed in terms of the proportion (conversion rate in our case) depends on **5 of the following parameters**:

- the conversion rate value of our control variation (variation A);
- the minimum difference between the values of variations A and B conversion rates which is to be identified;
- chosen confidence/significance level;
- chosen statistical power;
- type of the test: one-or two-tailed test.

### Determining Sample Size for our MSQRD Mobile A/B Testing

Let’s clarify the above-mentioned parameters and **determine the sample size** for our MSQRD example:

- The conversion rate of variation A: 20%
*(CR(A)**= 0.2);* - The conversion rate of variation B: 26%
*(CR(B)**= 0.26).*

Thus, the conversion rate value of our control variation A is 20% *(CR(A)* *= 0.2)*. Our example presumes that:

- the
**minimum difference**between the conversion values of variations A and B is 6% in absolute terms; - variation B performed
**better**than variation A (*CR(B)**= 0.26*).

In the course of sample size determination, some calculators for A/B testing request **minimum conversion rate difference** to be formulated in relative terms instead of absolute. In our example, the minimum difference of** 6% in absolute terms** corresponds to the **relative difference of 30%** (20% * 0.3 = 6%).

As it was clarified in our post on mobile A/B testing results analysis, the sum of confidence level and significance level values should be 100%. Let’s choose the **confidence level of 95%** and the significance level of 5% for our MSQRD example as these are the values of the parameters which are **most commonly used** in A/B tests.

### Type I Error

Let’s denote the confidence level by **α**, then the significance level – by **1- α**. Mind that the significance level is the probability of erroneously **rejecting the null hypothesis** (type I error). In the context of mobile A/B testing, it presupposes the probability of concluding that the conversion rates of variations A and B differ when in fact they are equal.

### Type II Error

There is another error type which should be minimized as well. This error **doesn’t reject the null hypothesis when it’s erroneous** (type II error). When it comes to mobile A/B testing, it means concluding that the conversion rates of variations A and B are equal when they differ actually.

In statistics, the probability of type II error is usually denoted by **β**, and the value equal to **1- β** is called statistical power. The** value of statistical power** in the course of A/B tests is 80% as a rule.

The choice of a one- or two-tailed test depends on **what we want to check**:

- A
**one-tailed test**is used if we want to check the significance of the observed positive difference in variations conversion rates (i.e. our goal is to replace variation A with variation B if the latter has better conversion rate). - A
**two-tailed test**is used if we want to check whether*CR(B)*and*CR(A)*differ (i.e. we are interested in both positive and negative difference)

Here’s an A/B test sample size formula for **one-tailed tests**:

In case of a **two-tailed test**, we use the following A/B test sample size formula:

*n1* – the number of visitors for each variation А and В in case of a one-tailed test;

*n2* – the number of visitors for each variation А and В in case of a two-tailed test;

*Z* – *standard score* or *Z-score**.*

The only difference between these two A/B testing sample size formulas is that **Z(α)** is used in the first one while the second uses **Z(α/2)**.

The values of Z(α), Z(α/2) and Z(1-β) can be calculated with the help of the **Excel function NORM.S.INV**:

At making the calculations for the MSQRD example we’ve mentioned above and **rounding the results**, we’ll get the following values:

These calculations can be made with the help of the free-to-use software Gpower.

Therefore, if *CR(A)* is 20% and the estimated *CR(B)* value is at least 26%, we’ll have to run our experiment until **each variation gets 608 different visitors** to check the statistical significance at the significance level of 5% and with 80% statistical power. Thus, the** total number of the experiment visitors** should be 1216.

In case we are interested in

both positive and negative conversion rate differences, the results will be slightly different.

If *CR(A)* is 20% and we need to find 6% difference in absolute terms, we’ll have to fill each variation with 772 different visitors to check the statistical significance at the significance level of 5% and with 80% statistical power. Therefore, **the total number of visitors within this mobile A/B testing **should be 1544.

Thus, if ** n** is the sample size calculated according to the method described above, to compare variation B with control variation A, we’ll need

*n***visitors**for variation A and

**visitors for variation B (the total is**

*n*

*2*n***visitors**

*).*

But what should we do if the experiment we run has 3 variations?

Let’s imagine we add the **third variation C** to our MSQRD experiment. This variation C should be compared with variation A (the control one). Therefore, we’ll have to fill variation C with *n* visitors as well and make the total number of experiment visitors equal to *3*n**.*

It’s really important to remember that

thesmaller the differencebetween the conversion rate values of the variations A and B which is to be identified,the greater sample sizeis required for the test.

That’s why if the conversion rates difference at the end of our experiment is less than the **expected minimum** (the one we used in our sample size calculations), we’ll need to **recalculate** the sample size for A/B testing taking into consideration the new assumed minimum value and continue the test.

Let’s go back to our MSQRD example and see how it works.

While calculating the original sample size for our **one-tail test**, we had:

*CR(A)*value*= 20%**CR(B)*minimal value*= 26%*

So we planned to finish the experiment when each variation is visited by 608 users.

Suppose that we filled each variation with the necessary number of users (1216) and got the following **results**:

*CR(A) = 20%**CR(B)**=*25%

We see that 25% is less than 26%. So it’s necessary to **recalculate our sample size** for the absolute **minimal difference of 5%**. The updated sample size will be 862. Thus, we need to run our test until we draw **254 more visitors** to each variation.

If we get *CR(A) = 20%* and *CR(B) = 25%* or more at the end of the test with 862*2 visitors, it’ll mean that variation В is **significantly better** than variation A.

## Afterthought on Mobile A/B Testing Sample Size

In this post, we examined the approach which presupposes determining a sample size **before we launch **our experiments and analyzing the results only after our mobile A/B testing is finished. This method is based on **statistical hypothesis testing** and guarantees results correctness.

This approach has a **drawback**: with a relatively small conversion rate value of the control variation and a small expected difference between variations A and B, the **required sample size is big**. There are cases when sample size numbers are really hard to achieve in the course of mobile A/B testing.

However, this problem can be solved with help of multi-armed bandit experiments that are based on applying **Bayesian Statistics**. Running your tests with SplitMetrics, you’re able to activate a multi-armed bandit, enjoy statistically significant experiments **without extra traffic expenses **and achieve app store success.

Share the knowledge!

Points for bringing up G-Power – it is indeed a great tool covering a ton of different cases, albeit it might be a bit overwhelming for a newcomer to statistics. However, a major issue with it and the Evan Miller tool is that they do not support comparisons between more than two groups, which is often the case in A/B testing where you would have group A – control, and treatment groups B, C, D. With these calculators you will not be able to calculate a proper sample size. Multiplying the sample size for two groups by 2 will actually result in an overpowered test.

Another issue is that as far as I’ve seen both tools do not support sample size calculations for relative difference (percent change, % lift), which is what is often sought as a final measure for the success or failure of an A/B test. Instead, they calculate sample sizes for absolute difference. This free calculator handles both of these cases:

https://www.gigacalculator.com/calculators/power-sample-size-calculator.php . It is also more friendly towards people new to stats than G-Power.

Best,

Georgi