A/A testing is the tactic of using a testing tool to test two identical variations against each other. Whether it is worth to conduct AA testing and, if so, for what purposes are the questions that invite conflicting opinions.
In this post, we explore why some users of testing tools like SplitMetrics practice A/A tests and dwell on the things they need to keep in mind while performing this sort of tests.
The main reasons for A/A testing are:
- checking the accuracy of A/B or Multi-Armed Bandit (MAB) testing tools;
- determining a baseline conversion rate for a control variation before beginning an A/B test.
Using A/A Test for Checking Accuracy of A/B Testing Tool
As a rule, users run A/A test for checking the accuracy of an A/B testing tool. It normally happens when they consider using a new A/B testing tool and want to get the proof that the tool is operating correctly.
Running such test, a user should keep in mind, that AA test is a non-typical scenario for an A/B testing tool, and have a good prior understanding of what to expect from A/B testing platform in such scenarios.
In an A/B test we determine sample size and run the experiment until each variation gets a determined number of visitors. When using the A/B testing tool to test identical variations, one should do the same. As we described in the post Determining Sample Size in A/B Testing, to calculate sample size for a trustworthy A/B test we need 5 parameters:
1) the conversion rate value of a control variation (variation A);
2) the minimum difference between variations A and B conversion rates which is to be identified;
3) chosen significance level;
4) chosen statistical power;
5) type of the test: one- or two-tailed test.
For AA test, the above-mentioned parameters 1, 3, 4 are to be set in the same way as in the case of A/B tests. The value of the second parameter — the minimum difference between variations A and B conversion rates that is to be identified — can be set as a small percentage of variation A conversion rate.
We recommend the value of the fifth parameter to be set as a two-tailed test as both positive and negative conversion rate differences should be considered.
What is A/A Testing: Example
Let’s consider the example. Suppose the conversion rate of the variation A is 20% (CR(A) = 0.2). Assume the resulting difference is less than 5% of this value, i.e. less than 0.01 (0.2 * 0.05 = 0.01). It indicates the identity of the tested variations.
One can calculate sample size to perform an AA testing with the significance level of 5%, the statistical power of 80% and two-tailed test with help of the free-to-use software Gpower.
Thus, one will have to run the A/A test until each variation gets 25 583 different visitors (totally 51 166 unique visitors). As seen, a required sample size is large, which makes the test extremely resource consuming.
If in the A/A test under consideration a less than 0.01 difference is identified, it will confirm the identity of variations and the accuracy of an A/B testing tool.
Interpreting Results of A/A Test Disproving Identity of Variations
How should results be interpreted if a correctly conducted A/A test does not confirm the identity of the variations?
When analyzing A/A test results, it is important to keep in mind that finding a difference in conversion rates between identical variations is always a possibility.
This is not necessarily an evidence of A/B testing tool bad accuracy, as there is always an element of randomness when it comes to testing (we explained reasons for randomness in the post on A/B test results analysis).
The significance level of an A/B test is the probability of concluding that the conversion rates of variations A and B differ when in fact they are equal (type I error). E.g. a significance level of 5% represents 1 in 20 chance that the results of a test are due to random chance.
If we repeat the same A/A test many times using accurately operating A/B testing tool, the proportion of the results that confirm the identity of the variations should be at least as high as a confidence level (at the significance level of 5%, the confidence level equals to 95%).
In addition to the above-mentioned statistical randomness, there are other reasons why correctly conducted A/A test does not confirm the identity of variations.
For example, the reason can be in a heterogeneity of the target audience. Suppose A/A test is conducted on the audience of all women, while conversion rates differ for women of different age groups.
In this case, if proportions of different age groups among visitors differ for two identical variations, and resulting conversion rates are calculated for all visitors, then a correctly conducted A/A test using accurately operating A/B testing tool can identify a significant difference between two identical variations.
A/A tests for Checking Accuracy of Bayesian MAB Testing Tools
Using an A/A test for checking the accuracy of a Bayesian Multi-Armed Bandit has certain problems and differs from one A/B testing tool to another. A Bayesian MAB testing tool does not require a pre-determined sample size. It calculates the probability to be optimal.
If in a Bayesian Bayesian Multi-Armed Bandit test with two variations it turns out that they perform about the same, any variation can be chosen.
A Bayesian MAB test will not be run until the optimal variation is found (because there are two optimal ones). It will run until it is sure that switching to another variation will not help us very much.
Thus, at some iteration in an A/A test, a winner will be declared, but this does not mean that Bayesian MAB test tool is not reliable.
Determining Baseline Conversion for Control Variation before A/B Testing
As it was mentioned above, one should know a baseline conversion rate for a control variation to calculate a sample size for an A/B test. To determine it some users conduct A/A tests. Let’s consider an example.
Suppose one is running an A/A test where the control gives 1 003 conversions out of 10 000 visitors and the identical variation gives 1 007 conversions out of 10 000 visitors.
The conversion rate for control is 10.03%, and that for identical is 10.07%. Then one uses 10.03% — 10.07% as the conversion rate range for control variation and conducts an A/B test in the following way.
If in the A/B test there is an uplift within this range, one considers that the result is not a significant one.
The approach, mentioned in the example, is not the correct one. In the posts A/B test results analysis, we explained the statistics behind A/B testing and showed a correct way to calculate confidence intervals and make a conclusion about the results significance of an A/B test.
The best way to determine a baseline conversion rate for a control variation is using a monitoring campaign, which is an experiment that does not have any variations.
Afterthought on A/A test
In some cases, it makes sense to run an A/A test if users are uncertain about a new A/B testing tool and want additional proof that it is operating accurately.
However, it is hardly worth running A/A tests to check an accuracy of an A/B testing tool more often than that. We have shown that a correct A/A test is very resource consuming — as a minimum difference between variations conversion rates is very small, a calculated sample size will be large.
Besides the likelihood of an inaccurate work of a testing tool, which is not a new product or a new version, is very small. Therefore, it is advisable for new users to concentrate on verifying the correct setting of A/B testing tool.
For a user planning to conduct A/A test for checking an accuracy of A/B testing tool, we recommend
- to pre-determine a sample size;
- to repeated A/A test several times (in the case of accurately operating A/B tool a proportion of results that confirm the identity of the variations should be at least as high as a confidence level);
- to take into account testing application peculiarities, e.g. a heterogeneity of the target audience and others.
As for Bayesian MAB testing tool, A/A test cannot be used to check the tool accuracy.
To determine a baseline conversion rate before the beginning of A/B test it’s more effective to use a monitoring campaign than A/A test.
What concerns correct settings for valid A/B tests with SplitMetrics, pay close attention to:
- Testing only one hypothesis at a time. Each alteration to your assets matters and testing multiple changes within one experiment makes identification of a winning element way harder.
- The quality of your hypothesis. It’s useless to run a test when the only difference between variations is the shade of a character’s tie.
- Experiments consistency. It’s better to start with optimizing such elements as screenshots and video previews.
- Choosing the right experiment type. If you rely mainly on organic traffic, opt for category and search A/B tests. If you rely on paid traffic, product page experiments are to become your number one priority. Mind that different experiment types call for a different approach to banners design.
- Determining optimal sample size for your experiment.
- Traffic source. Use only verified traffic channels, thus you’ll get trustworthy results faster.
- Smart targeting. It not only about the right demographic characteristics, you’re to pay attention to such details as, let’s say, correspondent device setting.
- The duration of an A/B test. Run experiments for at least 7-10 days to track users behaviour on all weekdays.
- Waiting for the SplitMetrics platform to determine the winner. Normally it takes 150 installs per variation but each case is highly individual.
With SplitMetrics, you can also launch a test with one variation. This feature provides incredible pre-launch opportunities – from product ideas validation to nailing the right targeting and identifying the most effective traffic source.
Sure, A/A tests can be helpful at times but they tests don’t require a pre-determined sample size and depend on the specificity of a testing tool and necessity of parameters being checked. Yet, it’s way more prudent to invest in top-notch A/B testing tool like the one SplitMetrics provides.