“The more tests, the better!” and other A/B testing myths, debunked

Reading Time: 8 minutes

Will the real A/B testing success metrics please stand up?

It’s 2017, and most marketers understand the importance of A/B testing. The strategy of applying the scientific method to marketing to prove whether an idea will have a positive impact on your bottom-line is no longer novel.

But, while the practice of A/B testing has become more and more common, too many marketers still buy into pervasive A/B testing myths. #AlternativeFacts.

This has been going on for years, but the myths continue to evolve. Other bloggers have already addressed myths like “A/B testing and conversion optimization are the same thing”, and “you should A/B test everything”.

As more A/B testing ‘experts’ pop up, A/B testing myths have become more specific. Driven by best practices and tips and tricks, these myths represent ideas about A/B testing that will derail your marketing optimization efforts if left unaddressed.

Avoid the pitfalls of ad-hoc A/B testing…

Get this guide, and learn how to build an optimization machine at your company. Discover how to use A/B testing as part of your bigger marketing optimization strategy!

By entering your email, you’ll receive bi-weekly WiderFunnel Blog updates and other resources to help you become an optimization champion.



But never fear! With the help of WiderFunnel Optimization Strategist, Dennis Pavlina, I’m going to rebut four A/B testing myths that we hear over and over again. Because there is such a thing as a successful, sustainable A/B testing program…

Into the light, we go!

Myth #1: The more tests, the better!

A lot of marketers equate A/B testing success with A/B testing velocity. And I get it. The more tests you run, the faster you run them, the more likely you are to get a win, and prove the value of A/B testing in general…right?

Not so much. Obsessing over velocity is not going to get you the wins you’re hoping for in the long run.

Mike St Laurent

The key to sustainable A/B testing output, is to find a balance between short-term (maximum testing speed), and long-term (testing for data-collection and insights).

Michael St Laurent, Senior Optimization Strategist, WiderFunnel

When you focus solely on speed, you spend less time structuring your tests, and you will miss out on insights.

With every experiment, you must ensure that it directly addresses the hypothesis. You must track all of the most relevant goals to generate maximum insights, and QA all variations to ensure bugs won’t skew your data.

Dennis Pavlina

An emphasis on velocity can create mistakes that are easily avoided when you spend more time on preparation.

Dennis Pavlina, Optimization Strategist, WiderFunnel

Another problem: If you decide to test many ideas, quickly, you are sacrificing your ability to really validate and leverage an idea. One winning A/B test may mean quick conversion rate lift, but it doesn’t mean you’ve explored the full potential of that idea.

You can often apply the insights gained from one experiment, when building out the strategy for another experiment. Plus, those insights provide additional evidence for testing a particular concept. Lining up a huge list of experiments at once without taking into account these past insights can result in your testing program being more scattershot than evidence-based.

While you can make some noise with an ‘as-many-tests-as-possible’ strategy, you won’t see the big business impact that comes from a properly structured A/B testing strategy.

Myth #2: Statistical significance is the end-all, be-all

A quick overview

Statistical significance tells us whether or not there is a true difference between variation and control (i.e. that the difference is not due to chance). To assess the likelihood, we use a metric called Significance Level (or Confidence Level).

At WiderFunnel, we recommend a 95% significance level whenever possible. If a winning variation is 95% statistically significant, you can be 95% confident that the results are not caused by randomness (a false positive).

Ok, here’s the thing about statistical significance: It is important, but marketers often talk about it as if it is the only determinant for completing an A/B test. In actuality, you cannot view it within a silo.

For example, the testing tool used during a recent experiment reported statistical significance just three hours after the experiment went live. Because statistical significance is viewed as the end-all, be-all, a result like this can be exciting! But, in three hours, we had not gathered a representative sample size.

Claire Vignon Keser

You should not wait for a test to be significant (because it may never happen) or stop a test as soon as it is significant. Instead, you need to wait for the calculated sample size to be reached before stopping a test. Use a test duration calculator to understand better when to stop a test.

After 24 hours, the same experiment and testing tool reported a confidence level of just 88%. That’s how quickly a confidence level can change if you are not taking into account sample size.

Traffic behaves differently over time for all businesses, so you should always run a test for full business cycles, even if you have reached statistical significance. This way, your experiment has taken into account all of the regular fluctuations in traffic that impact your business.

For an e-commerce business, a full business cycle is typically a one-week period; for subscription-based businesses, this might be one month or longer.

Myth #2, Part II: You have to run a test until it reaches statistical significance

As Claire pointed out, this may never happen. And it doesn’t mean you should walk away from an A/B test, completely.

With testing experience, an expert understanding of your testing tool, and by observing the factors I’m about to outline, you can discover actionable insights that are directional (directionally true or false).

You can use these factors to make the decision that makes the most sense for your business: implement the variation based on the observed trends, abandon the variation based on observed trends, and/or create a follow-up test!

  • Results stability: Is the conversion rate difference stable over time, or does it fluctuate? Stability is a positive indicator.
ab testing results stability
Check your graphs! Are conversion rates crossing? Are the lines smooth and flat, or are there spikes and valleys?
  • Experiment timeline: Did I run this experiment for at least a full business cycle? Did conversion rate stability last throughout that cycle?
  • Relativity: If my testing tool uses t-test to determine significance, am I looking at the hard numbers of actual conversions in addition to conversion rate? Does the calculated lift make sense?
  • LIFT & ROI: Is there still potential for the experiment to achieve X% lift? If so, you should let it run as long as it is viable, especially when considering the ROI.
  • Impact on other elements: If elements outside the experiment are unstable (social shares, average order value, etc.) the observed conversion rate may also be unstable.

Myth #3: An A/B test is only as good as its effect on conversion rates

Well, if conversion rate is the only success metric you are tracking, this may be true. But you’re underestimating the true growth potential of A/B testing if that’s how you structure your tests!

To clarify: Your main success metric should always be linked to your biggest revenue driver.

But, that doesn’t mean you shouldn’t track other relevant metrics! At WiderFunnel, we set up as many relevant secondary goals (clicks, visits, field completions, etc.) as possible for each experiment.

Dennis Pavlina

This ensures that we aren’t just gaining insights about the impact a variation has on conversion rate, but also the impact it’s having on visitor behavior.

– Dennis Pavlina

When you observe secondary goal metrics, your A/B testing becomes exponentially more valuable because every experiment generates a wide range of secondary insights. These can be used to create follow up experiments, identify pain points, and create a better understanding of how visitors move through your site.

An example

One of our clients provides an online consumer information service — users type in a question and get an Expert answer. This client has a 4-step funnel. With every test we run, we aim to increase transactions: the final, and most important conversion.

But, we also track secondary goals, like click-through-rates, and refunds/chargebacks, so that we can observe how a variation influences visitor behavior.

In one experiment, we made a change to step one of the funnel (the landing page). Our goal was to set clearer visitor expectations at the beginning of the purchasing experience. We tested 3 variations against the original, and all 3 won resulted in increased transactions (hooray!).

The secondary goals revealed important insights about visitor behavior, though! Firstly, each variation resulted in substantial drop-offs from step 1 to step 2…fewer people were entering the funnel. But, from there, we saw gradual increases in clicks to steps 3 and 4.

Our variations seemed to be filtering out visitors without strong purchasing intent. We also saw an interesting pattern with one of our variations: It increased clicks from step 3 to step 4 by almost 12% (a huge increase), but decreased actual conversions by -1.6%. This result was evidence that the call-to-action on step 4 was extremely weak (which led to a follow-up test!)

ab testing funnel analysis
You can see how each variation fared against the Control in this funnel analysis.

We also saw large decreases in refunds and chargebacks for this client, which further supported the idea that the right visitors (i.e. the wrong visitors) were the ones who were dropping off.

This is just a taste of what every A/B test could be worth to your business. The right goal tracking can unlock piles of insights about your target visitors.

Myth #4: A/B testing takes little to no thought or planning

Believe it or not, marketers still think this way. They still view A/B testing on a small scale, in simple terms.

But A/B testing is part of a greater whole—it’s one piece of your marketing optimization program—and you must build your tests accordingly. A one-off, ad-hoc test may yield short-term results, but the power of A/B testing lies in iteration, and in planning.

ab testing infinity optimization process
A/B testing is just a part of the marketing optimization machine.

At WiderFunnel, a significant amount of research goes into developing ideas for a single A/B test. Even tests that may seem intuitive, or common-sensical, are the result of research.

ab testing planning
The WiderFunnel strategy team gathers to share and discuss A/B testing insights.

Because, with any test, you want to make sure that you are addressing areas within your digital experiences that are the most in need of improvement. And you should always have evidence to support your use of resources when you decide to test an idea. Any idea.

So, what does a revenue-driving A/B testing program actually look like?

Today, tools and technology allow you to track almost any marketing metric. Meaning, you have an endless sea of evidence that you can use to generate ideas on how to improve your digital experiences.

Which makes A/B testing more important than ever.

An A/B test shows you, objectively, whether or not one of your many ideas will actually increase conversion rates and revenue. And, it shows you when an idea doesn’t align with your user expectations and will hurt your conversion rates.

And marketers recognize the value of A/B testing. We are firmly in the era of the data-driven CMO: Marketing ideas must be proven, and backed by sound data.

But results-driving A/B testing happens when you acknowledge that it is just one piece of a much larger puzzle.

One of our favorite A/B testing success stories is that of DMV.org, a non-government content website. If you want to see what a truly successful A/B testing strategy looks like, check out this case study. Here are the high level details:

We’ve been testing with DMV.org for almost four years. In fact, we just launched our 100th test with them. For DMV.org, A/B testing is a step within their optimization program.

Continuous user research and data gathering informs hypotheses that are prioritized and created into A/B tests (that are structured using proper Design of Experiments). Each A/B test delivers business growth and/or insights, and these insights are fed back into the data gathering. It’s a cycle of continuous improvement.

And here’s the kicker: Since DMV.org began A/B testing strategically, they have doubled their revenue year over year, and have seen an over 280% conversion rate increase. Those numbers kinda speak for themselves, huh?

What do you think?

Do you agree with the myths above? What are some misconceptions around A/B testing that you would like to see debunked? Let us know in the comments!

Enjoy this post? Share with your friends and colleagues:

Author

Contributor

  • stucchio

    I just wanted to leave a quick correction to this article. Your definition of “statistical significance” is actually completely wrong.

    In fact, if your test results are statistically significant at the 5% level, this means that there is a 5% chance that a hypothetical future A/A test (i.e. an A/B test, but where B is identical to the control) returns a result at least as strong as what you just saw.

    Concretely – if I see a 7% uplift in a test with a statistical significance of 4% and 10,000 samples, this means that if I ran a new A/A test with 10,000 samples, there’s a 4% chance that I’ll see at least 7% uplift even though both variation and control are exactly the same.

    Only Bayesian tools (such as VWO or Google Optimize) can actually provide you with the chance that you made a mistake during the test.

    Disclaimer: I’m the director of data science for VWO.

    • Hey Chris,
      Thanks for the feedback! I actually remember you from a post that Alhan Keser wrote on the t-test. You’re right – I was trying to communicate the concept in laymen’s terms, and may have oversimplified the explanation. We have updated the section on statistical significance to read clearer.