Beyond A vs. B: How to get better results with better experiment design

Reading Time: 7 minutes

You’ve been pushing to do more testing at your organization.

You’ve heard that your competitors at ______ are A/B testing, and that their customer experience is (dare I say it?) better than yours.

You believe in marketing backed by science and data, and you have worked to get the executive team at your company on board with a tested strategy.

You’re excited to begin! To learn more about your customers and grow your business.

You run one A/B test. And then another. And then another. But you aren’t seeing that conversion rate lift you promised. You start to hear murmurs and doubts. You start to panic a little.

You could start testing as fast as you can, trying to get that first win. (But you shouldn’t).

Instead, you need to reexamine how you are structuring your tests. Because, as Alhan Keser writes,

Alhan Keser

If your results are disappointing, it may not only be what you are testing – it is definitely how you are testing. While there are several factors for success, one of the most important to consider is Design of Experiments (DOE).

This isn’t the first (or even the second) time we have written about Design of Experiments on the WiderFunnel blog. Because that’s how important it is. Seriously.

For this post, I teamed up with Director of Optimization Strategy, Nick So, and Optimization Strategist, Michael St Laurent, to take a deeper look at the best ways to structure your experiments for maximum growth and insights.

Discover the best experiment structure for you!

Compare the pro's and con's of different Design of Experiment tactics with this simple download. The method you choose is up to you!



Warning: Things will get a teensy bit technical, but this is a vital part of any high-performing marketing optimization program.

The basics: Defining A/B, MVT, and factorial

Marketers often use the term ‘A/B testing’ to refer to marketing experimentation in general. But there are multiple different ways to structure your experiments. A/B testing is just one of them.

Let’s look at a few: A/B testing, A/B/n testing, multivariate (MVT), and factorial design.

A/B test

In an A/B test, you are testing your original page / experience (A) against a single variation (B) to see which will result in a higher conversion rate. Variation B might feature a multitude of changes (i.e. a ‘cluster’) of changes, or an isolated change.

ab test widerfunnel
When you change multiple elements in a single variation, you might see lift, but what about insights?

In an A/B/n test, you are testing more than two variations of a page at once. “N” refers to the number of versions being tested, anywhere from two versions to the “nth” version.

Multivariate test (MVT)

With multivariate testing, you are testing each, individual change, isolated one against another, by mixing and matching every possible combination available.

Imagine you want to test a homepage re-design with four changes in a single variation:

  • Change A: New hero banner
  • Change B: New call-to-action (CTA) copy
  • Change C: New CTA color
  • Change D: New value proposition statement

Hypothetically, let’s assume that each change has the following impact on your conversion rate:

  • Change A = +10%
  • Change B = +5%
  • Change C = -25%
  • Change D = +5%

If you were to run a classic A/B test―your current control page (A) versus a combination of all four changes at once (B)―you would get a hypothetical decrease of -5% overall (10% + 5% – 25% +5%). You would assume that your re-design did not work and most likely discard the ideas.

With a multivariate test, however, each of the following would be a variation:

mvt widerfunnel

Multivariate testing is great because it shows you the positive or negative impact of every single change, and every single combination of every change, resulting in the most ideal combination (in this theoretical example: A + B + D).

However, this strategy is kind of impossible in the real world. Even if you have a ton of traffic, it would still take more time than most marketers have for a test with 15 variations to reach any kind of statistical significance.

The more variations you test, the more your traffic will be split while testing, and the longer it will take for your tests to reach statistical significance. Many companies simply can’t follow the principles of MVT because they don’t have enough traffic.

Enter factorial experiment design. Factorial design allows for the speed of pure A/B testing combined with the insights of multivariate testing.

Factorial design: The middle ground

Factorial design is another method of Design of Experiments. Similar to MVT, factorial design allows you to test more than one element change within the same variation.

The greatest difference is that factorial design doesn’t force you to test every possible combination of changes.

Rather than creating a variation for every combination of changed elements (as you would with MVT), you can design your experiment to focus on specific isolations that you hypothesize will have the biggest impact.

With basic factorial experiment design, you could set up the following variations in our hypothetical example:

VarA: Change A = +10%
VarB: Change A + B = +15%
VarC: Change A + B + C = -10%
VarD: Change A + B + C + D = -5%

Factorial design widerfunnel
In this basic example, variation A features a single change; VarB is built on VarA, and VarC is built on VarB.

NOTE: With factorial design, estimating the value (e.g. conversion rate lift) of each change is a bit more complex than shown above. I’ll explain.

Firstly, let’s imagine that our control page has a baseline conversion rate of 10% and that each variation receives 1,000 unique visitors during your test.

When you estimate the value of change A, you are using your control as a baseline.

factorial testing widerfunnel
Variation A versus the control.

Given the above information, you would estimate that change A is worth a 10% lift by comparing the 11% conversion rate of variation A against the 10% conversion rate of your control.

The estimated conversion rate lift of change A = (11 / 10 – 1) = 10%

But, when estimating the value of change B, variation A must become your new baseline.

factorial testing widerfunnel
Variation B versus variation A.

The estimated conversion rate lift of change B = (11.5 / 11 – 1) = 4.5%

As you can see, the ‘value’ of change B is slightly different from the 5% difference shown above.

 
When you structure your tests with factorial design, you can work backwards to isolate the effect of each individual change by comparing variations. But, in this scenario, you have four variations instead of 15.

Mike St Laurent

We are essentially nesting A/B tests into larger experiments so that we can still get results quickly without sacrificing insights gained by isolations.

– Michael St Laurent, Optimization Strategist, WiderFunnel

Then, you would simply re-validate the hypothesized positive results (Change A + B + D) in a standard A/B test against the original control to see if the numbers align with your prediction.

Factorial allows you to get the best potential lift, with five total variations in two tests, rather than 15 variations in a single multivariate test.

But, wait…

It’s not always that simple. How do you hypothesize which elements will have the biggest impact? How do you choose which changes to combine and which to isolate?

The Strategist’s Exploration

The answer lies in the Explore (or research gathering) phase of your testing process.

At WiderFunnel, Explore is an expansive thinking zone, where all options are considered. Ideas are informed by your business context, persuasion principles, digital analytics, user research, and your past test insights and archive.

Experience is the other side to this coin. A seasoned optimization strategist can look at the proposed changes and determine which changes to combine (i.e. cluster), and which changes should be isolated due to risk or potential insights to be gained.

At WiderFunnel, we don’t just invest in the rigorous training of our Strategists. We also have a 10-year-deep test archive that our Strategy team continuously draws upon when determining which changes to cluster, and which to isolate.

Factorial design in action: A case study

Once upon a time, we were testing with Annie Selke, a retailer of luxury home-ware goods. This story follows two experiments we ran on Annie Selke’s product category page.

(You may have already read about what we did during this test, but now I’m going to get into the details of how we did it. It’s a beautiful illustration of factorial design in action!)

Experiment 4.7

In the first experiment, we tested three variations against the control. As the experiment number suggests, this was not the first test we ran with Annie Selke, in general. But it is the ‘first’ test in this story.

ab testing marketing control
Experiment 4.7 control product category page.

Variation A featured an isolated change to the ‘Sort By’ filters below the image, making it a drop down menu.

ab testing marketing example
Replaced original ‘Sort By’ categories with a more traditional drop-down menu.

Evidence?

This change was informed by qualitative click map data, which showed low interaction with the original filters. Strategists also theorized that, without context, visitors may not even know that these boxes are filters (based on e-commerce best practices). This variation was built on the control.

Variation B was also built on the control, and featured another isolated change to reduce the left navigation.

ab testing marketing example
Reduced left-hand navigation.

Evidence?

Click map data showed that most visitors were clicking on “Size” and “Palette”, and past testing had revealed that Annie Selke visitors were sensitive to removing distractions. Plus, the persuasion principle, known as the Paradox of Choice, theorizes that more choice = more anxiety for visitors.

Unlike variation B, variation C was built on variation A, and featured a final isolated change: a collapsed left navigation.

Collapsed left-hand filter (built on VarA).
Collapsed left-hand filter (built on VarA).

Evidence?

This variation was informed by the same evidence as variation B.

Results

Variation A (built on the control) saw a decrease in transactions of -23.2%.
Variation B (built on the control) saw no change.
Variation C (built on variation A) saw a decrease in transactions of -1.9%.

But wait! Because variation C was built on variation A, we knew that the estimated value of change C (the collapsed filter), was 19.1%.

The next step was to validate our estimated lift of 19.1% in a follow up experiment.

Experiment 4.8

The follow-up test also featured three variations versus the original control. Because, you should never waste the opportunity to gather more insights!

Variation A was our validation variation. It featured the collapsed filter (change C) from 4.7’s variation C, but maintained the original ‘Sort By’ functionality from 4.7’s control.

ab testing marketing example
Collapsed filter & original ‘Sort By’ functionality.

Variation B was built on variation A, and featured two changes emphasizing visitor fascination with colors. We 1) changed the left nav filter from “palette” to “color”, and 2) added color imagery within the left nav filter.

ab testing marketing example
Updated “palette” to “color”, and added color imagery. (A variation featuring two clustered changes).

Evidence?

Click map data suggested that Annie Selke visitors are most interested in refining their results by color, and past test results also showed visitor sensitivity to color.

Variation C was built on variation A, and featured a single isolated change: we made the collapsed left nav persistent as the visitor scrolled.

ab testing marketing example
Made the collapsed filter persistent.

Evidence?

Scroll maps and click maps suggested that visitors want to scroll down the page, and view many products.

Results

Variation A led to a 15.6% increase in transactions, which is pretty close to our estimated 19% lift, validating the value of the collapsed left navigation!

Variation B was the big winner, leading to a 23.6% increase in transactions. Based on this win, we could estimate the value of the emphasis on color.

Variation C resulted in a 9.8% increase in transactions, but because it was built on variation A (not on the control), we learned that the persistent left navigation was actually responsible for a decrease in transactions of -11.2%.

This is what factorial design looks like in action: big wins, and big insights, informed by human intelligence.

The best testing framework for you

What are your testing goals?

If you are in a situation where potential revenue gains outweigh the potential insights to be gained or your test has little long-term value, you may want to go with a standard A/B cluster test.

If you have lots and lots of traffic, and value insights above everything, multivariate may be for you.

If you want the growth-driving power of pure A/B testing, as well as insightful takeaways about your customers, you should explore factorial design.

A note of encouragement: With factorial design, your tests will get better as you continue to test. With every test, you will learn more about how your customers behave, and what they want. Which will make every subsequent hypothesis smarter, and every test more impactful.

One 10% win without insights may turn heads your direction now, but a test that delivers insights can turn into five 10% wins down the line. It’s similar to the compounding effect: collecting insights now can mean massive payouts over time.

– Michael St Laurent

Enjoy this post? Share with your friends and colleagues:

Author

Contributors

Nick So

Director of Optimization Strategy

Nick brings experiments to life. He leads research and strategy to come up with tests and hypotheses to help clients gain business insights and revenue lift.

Michael St Laurent

Optimization Strategist

Michael ensures WiderFunnel delivers the most accurate experiments in the shortest time possible. He strives to deliver A-ha moments for clients everyday, and isn't satisfied until there are no more questions left to be answered.