One of the most popular questions app publishers ask is **how much traffic** they need to run valid A/B tests. Unfortunately, there is **no answer with a magic number** that will fit every single experiment. An optimal traffic volume for mobile A/B testing is **individual** and depends on such **factors** as a traffic source, app’s conversion rate, and targeting.

Now let’s get to the main point: how to determine the sample size for A/B tests? It’s really important to have a full understanding of it as sample size has a considerable effect on **checking the significance** of the observed difference in variations performance.

In this post, we’ll also review one of the A/B test **sample size measuring methods** which is widely used and helps to make a statistically **valid decision** based on the results of your mobile A/B testing.

## Sample Size Influence on Checking Results Trustworthiness within Mobile A/B Testing

Let’s consider that MSQRD decided to check if changed **order of screenshots** (Variation B) favors a better conversion rate. Presume that we got the **following results** after filling variations with 200 different users each:

- Variation A – 40 converted users;
- Variation B – 57 converted users.

Thus, the observed difference in variations performance shows that the result is **statistically significant **at the confidence level of **95%**. The picture above shows the result of such validation performed using online mobile A/B testing calculator.

Let’s imagine we didn’t finish the experiment at reaching the above-mentioned result and **continued driving traffic**. When each variation got 500 users, we got the following results:

- Variation A – 101 converted users;
- Variation B – 127 converted users.

In this case, the significance checking will show that the observed performance difference** isn’t that statistically significant** at 95% confidence level.

Is the example we examined realistic? Sure, it is.

For instance, when the exact conversion values of variations A and B are **20% and 26% respectively**, these values are within the appropriate** confidence intervals** for cases with both 200 and 500 visitors per variation.

According to this example, if we finished the experiment at reaching **200 visitors for each variation**, it would be possible to come to the conclusion that variation B performed better. However, if we finished the test after having 500 visitors on each product page variant, we could conclude that **both variations are interchangeable**. Pretty confusing, isn’t it?

It raises the legitimate question:

How many users do we need to run trustworthy mobile A/B tests?

Thus, we need to figure out what sample size is necessary for getting statistically significant results in the course of our mobile A/B testing.

## How to Calculate A/B Testing Sample Size

Now, let’s review how to calculate a sample size for A/B tests based on statistical hypothesis testing.

First, we need to understand what **null hypothesis **really is. In mobile A/B testing, the null hypothesis is normally represented by the assumption that the difference between the performances of variations A and B **equals to zero**.

It has been theoretically proven that the sample size required for **acceptance/rejection** of the null hypothesis for KPI expressed in terms of the proportion (conversion rate in our case) depends on **5 of the following parameters**:

- the conversion rate value of our control variation (variation A);
- the minimum difference between the values of variations A and B conversion rates which is to be identified;
- chosen confidence/significance level;
- chosen statistical power;
- type of the test: one-or two-tailed test.

### Determining Sample Size for our MSQRD Mobile A/B Testing

Let’s clarify the above-mentioned parameters and **determine the sample size** for our MSQRD example:

- The conversion rate of variation A: 20%
*(CR(A)**= 0.2);* - The conversion rate of variation B: 26%
*(CR(B)**= 0.26).*

Thus, the conversion rate value of our control variation A is 20% *(CR(A)* *= 0.2)*. Our example presumes that:

- the
**minimum difference**between the conversion values of variations A and B is 6% in absolute terms; - variation B performed
**better**than variation A (*CR(B)**= 0.26*).

In the course of sample size determination, some calculators for A/B testing request **minimum conversion rate difference** to be formulated in relative terms instead of absolute. In our example, the minimum difference of** 6% in absolute terms** corresponds to the **relative difference of 30%** (20% * 0.3 = 6%).

As it was clarified in our post on mobile A/B testing results analysis, the sum of confidence level and significance level values should be 100%. Let’s choose the **confidence level of 95%** and the significance level of 5% for our MSQRD example as these are the values of the parameters which are **most commonly used** in A/B tests.

### Type I Error

Let’s denote the confidence level by **α**, then the significance level – by **1- α**. Mind that the significance level is the probability of erroneously **rejecting the null hypothesis** (type I error). In the context of mobile A/B testing, it presupposes the probability of concluding that the conversion rates of variations A and B differ when in fact they are equal.

### Type II Error

There is another error type which should be minimized as well. This error **doesn’t reject the null hypothesis when it’s erroneous** (type II error). When it comes to mobile A/B testing, it means concluding that the conversion rates of variations A and B are equal when they differ actually.

In statistics, the probability of type II error is usually denoted by **β**, and the value equal to **1- β** is called statistical power. The** value of statistical power** in the course of A/B tests is 80% as a rule.

The choice of a one- or two-tailed testing depends on **what we want to check**:

- A
**one-tailed test**is used if we want to check the significance of the observed positive difference in variations conversion rates (i.e. our goal is to replace variation A with variation B if the latter has better conversion rate). - A
**two-tailed test**is used if we want to check whether*CR(B)*and*CR(A)*differ (i.e. we are interested in both positive and negative difference)

Here’s an A/B test sample size formula for **one-tailed tests**:

In case of a **two-tailed testing**, we use the following A/B testing

sample size formula:

*n1* – the number of visitors for each variation А and В in case of a one-tailed test;

*n2* – the number of visitors for each variation А and В in case of a two-tailed test;

*Z* – *standard score* or *Z-score**.*

The only difference between these two A/B testing sample size formulas is that **Z(α)** is used in the first one while the second uses **Z(α/2)**.

The values of Z(α), Z(α/2) and Z(1-β) can be calculated with the help of the **Excel function NORM.S.INV**:

At making the calculations for the MSQRD example we’ve mentioned above and **rounding the results**, we’ll get the following values:

These calculations can be made with the help of the free-to-use software Gpower.

Therefore, if *CR(A)* is 20% and the estimated *CR(B)* value is at least 26%, we’ll have to run our experiment until **each variation gets 608 different visitors** to check the statistical significance at the significance level of 5% and with 80% statistical power. Thus, the** total number of the experiment visitors** should be 1216.

In case we are interested in **both positive and negative conversion rate differences**, the results will be slightly different.

If *CR(A)* is 20% and we need to find 6% difference in absolute terms, we’ll have to fill each variation with 772 different visitors to check the statistical significance at the significance level of 5% and with 80% statistical power. Therefore, **the total number of visitors within this mobile A/B testing **should be 1544.

Thus, if ** n** is the sample size calculated according to the method described above, to compare variation B with control variation A, we’ll need

*n***visitors**for variation A and

**visitors for variation B (the total is**

*n*

*2*n***visitors**

*).*

But what should we do if the experiment we run has 3 variations?