A/B Tests Results Analysis: Statistical Significance, Confidence Level and Intervals

statistical principles of SplitMetrics A/B tests

The aim of product page A/B testing is to check if a modified version of an app page element is better compared to the control variation in terms of a certain KPI. In the context of app store pages A/B testing, conversion becomes a core KPI most of the times.

However, we all know that it’s not enough to create an experiment with 2 variations, fill it with a dozen of users and expect distinctive and trustworthy results. What turns any split-test into an A/B test you can trust then?

The main characteristic that defines a successful A/B experiment is a high statistical significance which presupposes you’ll actually get a conversion increase the test promised uploading a winning variation to the store.

In this post, we’ll try to deconstruct each step of calculations behind checking the statistical significance of A/B test results on the following example.

SplitMetrics A/B Tests Results Analysis

 

Let’s consider that the Prisma team wanted to run an A/B test and find out if their new set of screenshots favors conversion rate increase. The existing screenshots were used in variation A which became the control one. The updated screens set was used in variation B.

Let’s imagine that each variation was visited by 14 500 different users:

  • Variation A: 1450 page visitors installed the app;
  • Variation B: 1600 page visitors installed the app.

What conclusions can be drawn from this test?

The A/B testing calculator will help us in our results analysis. As it can be seen in the picture below, the confidence level we chose amounts to 95%. This is the parameter value which is usually used in split-tests. You can also come across 90% and 99% confidence levels, other parameter values are quite rare.

A/B testing conversion interval analysis

As you can see, the calculator provided the confidence interval of the conversion rate for each variation. This metric measures the probability that app store conversion will fall between two set values.

This estimation leads us to the conclusion that variation B with 10.5% – 11.6% conversion interval performed better than control variation A with 9.5%-10.5% interval.

Why do We Need Confidence Interval?

Point estimate of conversion is the ratio of the converted users number to the total number of users that visited the page. So let’s calculate point estimate value for variations A and B of our example.

  • CR(A) = 1450 / 14500 = 0.1;
  • CR(B) = 1600 / 14500 = 0.11.

Therefore, we see once again that variation B triggers 0.01 better conversion in comparison with the control variation A. It means that the conversion of variation B is 10% better than the conversion rate of variation A (0.01 / 0.1 = 0.1).

confidence interval and level in SplitMetrics

Mind that the conversion rates we got are not exact, they represent estimated values which means they are the result of product page testing on two groups of users randomly chosen from the statistical population. Indeed, it’s quite unlikely that someone is capable of running a test on every single app store user that meets your targeting.

However, why we can use estimated values instead of exact ones?

Let’s consider an example to answer this question. Let’s imagine that a company with 20 000 employees decided to test the page of their interior web-service. Suppose all employees participated in the experiment and 1980 of them took the desired action. Therefore, we can calculate the exact conversion rate:

1980 / 20000 = 9.9%

Yet, it’s quite problematic to have a test on all employees in practice. Let alone tracking behaviour of all potential app users.

That’s why we normally run tests on a statistical population sample of randomly chosen users.

This statistical population sample of randomly chosen users is referred to as sample size in statistics.

Going back to our example, if we run a test on 500 randomly chosen employees of the company, it’s impossible to get 9.9% conversion rate as 500 * 9.9% = 49.5 and the number of people can’t be nonintegral.

Thus, we are to presuppose a certain measurement of error to represent sum total using a group.

Let’s imagine that the company ran 2 tests with 500 employees each:

  • Test 1: 49 people took the desired action so point estimate is 9.8%;
  • Test 2: 50 people took the desired action so point estimate is 10%.

These numbers are different and they don’t coincide with the exact conversion rate (9.9%) either. Nevertheless, if we set ± 0.2% interval, we’ll get the interval estimate of 9.6% – 10% and our exact estimate will be within this range.

How to Calculate Confidence Interval?

Let’s examine the formula behind confidence intervals for Prisma’s experiment the calculator we used provided:  

formula for confidence intervals

CRpoint estimate of conversion rate;

nsample size (14500 uses that visited a corresponding product page variation),

Zαcoefficient corresponding to Confidence Level α (in statistical terms, it’s Z-score of standard deviation).

Zα can be calculated with Excel NORM.S.INV function. Here is the table containing rounded Zαvalues for the most widely used confidence levels:

rounded Z-values for confidence levels

The NORM.S.INV argument is calculated with the following formula in which α stands for a confidence level:

NORM.S.INV argument calculations

Let’s do the calculations for the 95% confidence level:

calculations for 95% confidence level

Now we can calculate the confidence interval for Prisma’s variation A (α = 95%, therefore, Zα = 1.96).

confidence interval for SplitMetrics test

Thus, the confidence interval for Prisma’s control variation A can be represented as 10% ± 0.5% or 9.5% – 10.5%.

How to Interpret Confidence Interval?

Now, we’ll try to interpret the result. Let’s assume that we know the exact conversion rate of variation A which means we can estimate the CR calculated for all potential visitors of variation A page.

If we decide to test the conversion of this product page running the same experiment on different user groups from the same statistical population, we can calculate confidence intervals for each of these groups using 95% confidence level.

It means that the exact conversion rate will be within confidence interval margins in 95% of cases.

Testing Statistical Significance of Conversion Rates Differences

We’ve already come to the conclusion that variation B is better than the control one as CR(B) is greater than CR(A). However, before uploading screenshots set from variation B to the store, it’s necessary to ensure that the difference of variations performance is statistically significant.

Such statistical significance testing is always bound to the confidence level we choose (95% in our case). If we enter the data from Prisma experiment to the A/B Tests Calculator, we’ll come across the following conclusion:

You can be 95% confident that this result is a consequence of the changes you made and not a result of random chance.

calculators for interpreting test results

Some analytical tools formulate the same conclusion in the following way: «Chance to beat original is 95%» (in our example, A is an original).

Mathematical tools for statistical hypothesis testing are commonly used to confirm the statistical significance of the difference in variations performance.  The following 3-step algorithm facilitates hypothesis testing process:

  1. generate two hypotheses (null and alternative);
  2. calculate the probability of proving the null hypothesis true – p-value parameter (you can notice that the calculator we used at the beginning provided this metric. However, mind that there are calculators that indicate point estimate instead of interval one and other statistic parameters necessary for p-value calculation such as Z-score and others);
  3. compare p-value with significance level referred to as 1-α, where α is a confidence level.

If the p-value is lower than 1-α, the alternative hypothesis is accepted with confidence level α. Otherwise, the null hypothesis is accepted.

Now, let’s check the statistical significance of the conclusion that conversion rate of variation B is greater than the one of the control variation A using the algorithm mentioned above.

Step 1

Our null hypothesis can be formulated as CR(B) – CR(A) = 0 which means the conversions of variations have no difference. Therefore, our alternative hypothesis will state that the conversion of variation B is greater than the one of the variation A: CR(B) > CR(A).

Step 2

The calculators we used showed that we should prove that alternative hypothesis is accepted with 95% confidence level. It means we’ll consider significance level of 5%.

First of all, it’s necessary to compute standard errors (SE) of conversion rates for variations A – SE(A) and B – SE(B) with the following formula:

standard errors of conversion rates

CR –conversion rate estimation;

n – the number of users that visited the variation.

Then, we have to calculate  the standard error of variations confidence level difference:

standard error of confidence level difference

The third step presupposes Z-score calculation using this formula:

Z-score calculation formula

Applying all these formulas and rounding the results, we’ll get:

  • SE(A)=0.002491,
  • SE(B)=0.002602,
  • SE(difference)=0.003602,
  • Z=2.8717.

Step 3

Now it’s time to calculate the p-value. Providing Z is a positive figure, p-value should be calculated via the area of space under the standard normal distribution after point Z. This area is shown by hatching in the image below.

standard normal distribution in statistics

To compute the p-value, we use Excel formula 1-NORM.S.DIST(Z; TRUE). Thus, the rounded p-value = 0.002. Then we should compare p-value with the significance level of 5%.

0.002 < 0.05 – therefore, we proved that Prisma’s optimized variation B is better than variation A.

You may justly wonder why the p-value we just computed is two times smaller than the p-value provided by the calculator we used at the beginning:

statistical p-value calculator

First of all, well done for being so attentive. Secondly, this mismatch can be explained by the fact that there are cases when a one-tailed p-value is calculated, in others – two-tailed is computed.

It all depends on what we actually test:

  • If we check how significant is the observed positive difference, a one-tailed p-value is calculated (just as in Prisma case we’ve been examining today);
  • If we are interested in both positive and negative difference, a two-tailed p-value is computed. It amounts to the total of space areas shown by hatching in the image below. It leads to the doubling of the one-tailed p-value.
two-tailed p-value

Afterthought

Today we deconstructed the process of results analysis. We did it on the example of a screenshots A/B test to showcase statistical principles behind SplitMetrics experiments.

Remember that a successful split-test can’t be launched or interpreted mindlessly. That’s why it’s so important to use trustworthy specialized platforms like SplitMetrics for product page A/B testing.

splitmetrics app ab test button