The aim of product page A/B testing is to check if a **modified version** of an app page element is better compared to the control variation in terms of a **certain KPI**. In the context of app store pages A/B testing, **conversion** becomes a core KPI most of the times.

However, we all know that it’s **not enough** to create an experiment with 2 variations, fill it with a dozen of users and expect distinctive and **trustworthy results**. What turns any split-test into an A/B test you can trust then?

The main characteristic that defines a successful A/B experiment is a high **statistical significance **which presupposes you’ll actually get a **conversion increase** the test promised uploading a winning variation to the store.

In this post, we’ll try to deconstruct **each step of calculations **behind checking the statistical significance of A/B test results on the following example.

Let’s consider that the **Prisma team** wanted to run an A/B test and find out if their new set of screenshots** favors conversion rate increase**. The existing screenshots were used in **variation A** which became the control one. The updated screens set was used in **variation B**.

Let’s imagine that each variation was visited by **14 500 different users**:

- Variation A:
**1450**page visitors installed the app; - Variation B:
**1600**page visitors installed the app.

What **conclusions** can be drawn from this test?

The A/B testing calculator will help us in our **results analysis**. As it can be seen in the picture below, the confidence level we chose amounts to **95%**. This is the parameter value which is usually used in split-tests. You can also come across **90%** and **99%** **confidence levels**, other parameter values are quite rare.

As you can see, the calculator provided the **confidence interval** of the conversion rate for each variation. This metric measures the **probability** that app store conversion will fall between two set values.

This estimation leads us to the conclusion that variation** B with 10.5% – 11.6%** conversion interval performed better than control variation **A with 9.5%-10.5% **interval.

**Why do We Need Confidence Interval? **

** Point estimate** of conversion is the ratio of the converted users number to the total number of users that visited the page. So

**let’s calculate**point estimate value for variations A and B of our example.

- CR(A) = 1450 / 14500 = 0.1;
- CR(B) = 1600 / 14500 = 0.11.

Therefore, we see once again that variation **B triggers 0.01 better conversion** in comparison with the control variation A. It means that the conversion of variation B is **10% better **than the conversion rate of variation A (0.01 / 0.1 = 0.1).

Mind that the conversion rates we got are not exact, they represent **estimated values** which means they are the result of product page testing on two groups of users randomly chosen from the statistical population. Indeed, it’s quite unlikely that someone is capable of running a test on **every single app store user **that meets your targeting.

However, why we can use** estimated values** instead of exact ones?

Let’s consider an example to answer this question. Let’s imagine that a company with **20 000 employees** decided to test the page of their interior web-service. Suppose all employees participated in the experiment and **1980 **of them took the desired action. Therefore, we can calculate the **exact conversion rate**:

1980 / 20000 = 9.9%

Yet, it’s quite problematic to have a **test on all employees** in practice. Let alone tracking behaviour of all potential app users.

That’s why we normally run tests on

a statistical population sampleof randomly chosen users.

This **statistical population sample **of randomly chosen users is referred to as **sample size** in statistics.

Going back to our example, if we run a test on **500 randomly chosen employees** of the company, it’s **impossible** to get 9.9% conversion rate as 500 * 9.9% = 49.5 and the number of people can’t be nonintegral.

Thus, we are to presuppose a certain **measurement of error** to represent sum total using a group.

Let’s imagine that the company ran **2 tests **with 500 employees each:

- Test 1: 49 people took the desired action so
*point estimate*is**9.8%**; - Test 2: 50 people took the desired action so
*point estimate*is**10%**.

These numbers are different and they don’t coincide with the** exact conversion rate** (9.9%) either. Nevertheless, if we set** ± 0.2% interval**, we’ll get the interval estimate of **9.6% – 10%** and our exact estimate will be within this range.

**How to Calculate Confidence Interval? **

Let’s examine the formula behind confidence intervals for Prisma’s experiment the calculator we used provided:

CR – *point estimate* of conversion rate;

n – *sample size* (14500 uses that visited a corresponding product page variation),

Zα – *coefficient corresponding* *to Confidence Level α* (in statistical terms, it’s Z-score of standard deviation).

Zα can be calculated with Excel NORM.S.INV function. Here is the table containing **rounded **Zαvalues for the most widely used confidence levels:

The **NORM.S.INV argument **is calculated with the following formula in which α stands for a confidence level:

Let’s do the calculations for the **95%** confidence level:

Now we can calculate the** confidence interval** for Prisma’s variation A (α = 95%, therefore, Zα = 1.96).

Thus, the confidence interval for Prisma’s control variation A can be represented as **10% ± 0.5%** or **9.5% – 10.5%**.

**How to Interpret Confidence Interval?**

Now, we’ll try to **interpret the result.** Let’s assume that we know the** exact conversion rate** of variation A which means we can estimate the CR calculated for **all potential visitors **of variation A page.

If we decide to test the conversion of this product page running the same experiment on different user groups from the same statistical population, we can calculate confidence intervals for each of these groups using **95%** confidence level.

It means that the **exact conversion rate will be within confidence interval margins** in 95% of cases.

**Testing Statistical Significance of Conversion Rates Differences**

We’ve already come to the conclusion that variation B is **better than the control **one as CR(B) is greater than CR(A). However, before uploading screenshots set from variation B to the store, it’s necessary to ensure that the difference of variations performance is **statistically significant**.

Such statistical significance testing is always **bound to the confidence level** we choose (95% in our case). If we enter the data from Prisma experiment to the A/B Tests Calculator, we’ll come across the following **conclusion**:

You can be

95%confident that this result is a consequence of the changes you made and not a result of random chance.

Some **analytical tools** formulate the same conclusion in the following way: «Chance to beat original is **95%**» (in our example, A is an original).

**Mathematical tools** for statistical hypothesis testing are commonly used to confirm the statistical significance of the difference in variations performance. The following **3-step algorithm** facilitates hypothesis testing process:

- generate
*two hypotheses*(null and alternative); *calculate the probability*of proving the null hypothesis true – p-value parameter (you can notice that the calculator we used at the beginning provided this metric. However, mind that there are calculators that indicateinstead of interval one and other statistic parameters necessary for*point estimate*calculation such as*p-value*and others);*Z-score**compare p-value*with significance level referred to as 1-α, where α is a confidence level.

If the ** p-value** is lower than 1-α, the alternative hypothesis is accepted with confidence level α. Otherwise, the

**is accepted.**

*null hypothesis*Now, let’s check the **statistical significance** of the conclusion that conversion rate of variation B is greater than the one of the control variation A using the algorithm mentioned above.

**Step 1**

Our ** null hypothesis** can be formulated as CR(B) – CR(A) = 0 which means the conversions of variations have no difference. Therefore, our

**will state that the conversion of variation B is greater than the one of the variation A: CR(B) > CR(A).**

*alternative hypothesis***Step 2**

The calculators we used showed that we should prove that alternative hypothesis is** accepted with 95%** confidence level. It means we’ll consider significance level of 5%.

First of all, it’s necessary to compute ** standard errors (SE) of conversion rates** for variations A – SE(A) and B – SE(B) with the following formula:

CR –conversion rate estimation;

n – the number of users that visited the variation.

Then, we have to calculate** **the

**of variations confidence level difference:**

*standard error*The third step presupposes ** Z-score** calculation using this formula:

Applying all these formulas and rounding the results, we’ll get:

**SE(A)**=0.002491,**SE(B)**=0.002602,**SE(difference)**=0.003602,**Z**=2.8717.

**Step 3**

Now it’s time to calculate the ** p-value**. Providing

**Z is a positive figure**, p-value should be calculated via the area of space under the standard normal distribution after point Z. This area is shown by hatching in the

**image below**.

To compute the ** p-value**, we use Excel formula

**1-NORM.S.DIST(Z; TRUE)**. Thus, the rounded

**= 0.002. Then we should compare**

*p-value*

*p-value***with the significance level of 5%**.

**0.002 < 0.05** – therefore, we proved that Prisma’s optimized variation B is better than variation A.

You may justly wonder why the** p-value** we just computed is two times smaller than the p-value provided by the

**calculator**we used at the beginning:

First of all, well done for being so attentive. Secondly, this** mismatch can be explained** by the fact that there are cases when a one-tailed ** p-value** is calculated, in others – two-tailed is computed.

It all depends on **what** we actually test:

- If we check how significant is the observed
**positive difference**, a one-tailed p-value is calculated (just as in Prisma case we’ve been examining today); - If we are interested in
**both positive and negative difference**, a two-tailed p-value is computed. It amounts to the total of space areas shown by hatching in the image below. It leads to the**doubling of the one-tailed p-value**.

**Afterthought**

Today we deconstructed the process of **results analysis**. We did it on the example of a screenshots A/B test to showcase **statistical principles** behind SplitMetrics experiments.

Remember that a **successful split-test** can’t be launched or interpreted mindlessly. That’s why it’s so important to use **trustworthy specialized platforms like SplitMetrics** for product page A/B testing.