Goodbye, t-test: new stats models for A/B testing boost accuracy, effectiveness

Reading Time: 5 minutes

The t-test has served as a workhorse for conversion optimization teams for many years — though there’s always been confusion about what the results really mean.

This statistical method that has driven A/B testing analyses has served us well, but it’s clear that the t-test is now outdated. With websites receiving and harvesting data constantly, the t-test is simply too slow and isn’t able to give teams the updated results they need to make fast business decisions.

Plus, typical t-test results reports have often been misleading, as you can see in the example below.
(Note: Chris Goward shows how to interpret t-test results here.)

This is what results used to look like with the classic t-test. A few minutes into an a/b test and we’ve already got a “winner”… right. Optimizely took the first step earlier this year in revolutionizing how a/b test results are calculated.
This is what results used to look like with the classic t-test. A few minutes into an a/b test and we’ve already got a “winner”… right. Optimizely took the first step earlier this year in revolutionizing how a/b test results are calculated.

Fortunately, there are now solutions. Optimizely and VWO have both developed new models for analyzing A/B testing and conversion results that can help businesses keep up with the flood of data they get from their websites. Here’s what you need to know.

The background

Back in 2007, Google launched its free Website Optimizer tool, which made A/B testing for websites much more accessible. It used the t-test, which looks at differences between means relative to spread or variability, to determine which variation was the winner or loser.
The launch inspired other optimization platforms. Optimizely and VWO emerged as the top two in a digital arms race. Features such as WYSIWYG editors made it easy to create website variations, while marketing software integration and mobile capabilities made the tools even more useful.

Why the change?

There are several reasons the t-test became outdated.

Leonid Pekelis
Leonid Pekelis
  • It simply isn’t made for the job. “The original test was meant to be run in the field with an end-point to collecting data”, says Optimizely’s in-house statistician, Leo Pekelis. “You set up your hypothesis, you gather data, you see if your data has evidence, then you report your results. So it’s a very compartmentalized and linear fashion of how to deal with test results.” The error rates calculated are for that procedure, not the way people collect a/b data now.
  • The t-test, as used traditionally in A/B testing, is plagued by false discoveries, especially when looking at results continuously — which is, of course, what conversation optimization teams do. The problem is that the results of an on-going t-test may appear to be conclusive earlier than when the results should be looked at.
  • The amount of tests run per month can number in the hundreds on high-power conversion optimization teams. The t-test has a potential to produce too many false positives, even with the right sample size of visitors. Meanwhile, major business decisions are waiting on the results.

What are the new A/B testing options?

New solutions from Optimizely and VWO have a similar goal: Put the t-test out to pasture, in favor of more useful and immediate statistical models that favor the demands of digital A/B testing.

Optimizely

Optimizely’s Stats Engine uses sequential testing. It’s better suited for A/B testing, because it evaluates results as data comes in, instead of  “fixed horizon testing” using a fixed sample size. Doing so brings a Bayesian view that creates a likelihood ratio, a model for how the variation is expected to perform.

Optimizely's stats engine.
Optimizely’s stats engine.

With Optimizely, you can look at your results at any time, and have more confidence in their reliability. You’re far less likely to see a variation falsely declared a winner. Optimizely says this is possible without sacrificing speed in most cases, especially when the actual lift in a variation is higher than what you would have set as your minimum detectable effect.
“When it comes to speed vs. accuracy in statistics, there’s no such thing as a free lunch,” says Pekelis, “We built Stats Engine to be as quick as possible while maintaining the integrity of our results.”

Visual Website Optimizer

Chris Stucchio
Chris Stucchio

VWO’s solution, SmartStats, is a little newer. It’s been in beta since August 2015 and just launched this month. This option goes full Bayesian: It gives a range, rather than a specific conversion rate, and becomes tighter as time goes on. Its main goal is to identify a best performer as fast as possible, rather than determine the variation of winning. “What the client wants is an exact number,” says Chris Stucchio, VWO’s in-house statistician. “But I’m a statistician, so I’m just going to be honest and say you cannot have an exact number — there isn’t one. All you can get is a range.” If you wait for an exact number, you sacrifice conversions.

In addition, SmartStats shows how much you could stand to lose if the variation performs at its minimum threshold conversion rate. It also has built-in trigger alerts to reduce methodological errors: The solution notifies you, for example, if you change your audience targeting mid-experiment or fail to run your experiment a full integer number of weeks.

Campaign warnings and campaign pause

What’s the difference between SmartStats and Stats Engine?

I was still curious: what would results look like if you ran the exact same experiment in both VWO and Optimizely at the same time? Which would give you a report faster and which would be more precise about the result?

According to Stucchio, VWO would likely give you the first result that says B is not worse than A and is probably better. “Some time after this, Stats Engine will say, ‘Yeah, you should probably deploy B.’ In the meantime, if you didn’t stop the test in VWO and you looked in our tool, you would discover that our credible intervals instead of saying 0-10 percent, they might say 4-7 percent.”

On the other hand, Pekelis points to the risk of average error control. “A fully Bayesian method will tend to make statements like ‘B is not worse than A’ sooner than Stats Engine, but this is problematic in one very important case: when B is actually worse than A,” he says.

Thomas Bayes
Thomas Bayes

“The rate of making a wrong call is controlled for Bayesian methods on average, meaning that for some experiments it will be higher and some it will be lower. The problem with this is customers who have more of the higher error experiments will see more mistakes than anticipated. Making a call as early as possible exacerbates this, and that’s the cost – some customers will be exposed to potentially a lot more errors. Are there cases where average error control is a good idea? Sure. But our philosophy is to explore solutions starting from a position of rigorous accuracy and clearly communicate any tradeoffs.

Pekelis suggests that the best way to speed up a test in Stats Engine is to lower the significance threshold, and accept a higher error rate. “We felt a known cost was better than an unknown one,” he says.

So what it comes down to is that with either tool, you’re going to be able to get more reliable results faster than before, but you’ll need to decide how much risk you’re willing to take in order to expedite your results. As always, you need to be scientific when it comes to your methodology. These tools will not make up for poor experiment design.

For those of you who’ve been calculating required sample sizes beforehand, your tests will now have the chance to end faster. This means that if it turns out that your test outperforms what your minimum detectable effect would have been had you used a classic t-test sample size calculator, you’ll be able to move on to your next test a lot sooner.

What also deserves mention is how much easier this evolution makes communicating results to stakeholders or clients. Whereas, before, results would look definitive with stakeholders asking why action wasn’t being taken, now, results will only look definitive when it is time. Imagine all of the painful conversations you can now avoid!

Here’s a comparison chart to compare the slight differences in how the two companies approach the solution.

Stats model comparison table

In the end, they’re both improvements to the old t-test and are great improvements to help optimization champions do better work.

Update on Oct 27: a previous version of the above comparison chart suggested that Stats Engine only provides an “exact number lift” when there is a range lift provided as well. The chart has been updated to reflect that fact.


About WiderFunnel
WiderFunnel creates profitable ‘A-ha!’ moments for clients. Our team of optimization experts works together with a singular focus: conversion optimization of our client’s customer touchpoints through insightful A/B testing. We don’t just consult and give advice — we test every recommendation to prove its value and gain tested insights. Contact us to learn more.

Enjoy this post? Share with your friends and colleagues:

Private: Alhan Keser

Alhan Keser

Optimization Strategist

Alhan was the head of WiderFunnel Labs and former Director of Optimization Strategy. He is currently Director of Digital Strategy at Blue Fountain Media.

  • This is another good reason to calculate the required sample size in advance of the test. 🙂
    P. S. The new widerfunnel website design is really cool 🙂

Recommended

Explore all blog posts