Sunday, July 27, 2014

A/B testing is like sex at high school

A few days ago I went on record saying that A/B testing is like sex at high school. Everyone talks about it, not very many do it in earnest. I want to follow up on the topic with some additional thoughts (don't worry, I won't stretch the high school analogy any further).

When talking to people about A/B testing I've noticed that there are four (stereo) types of mindsets which prevent companies from successfully using split tests as a tool to improve their conversion funnel.

1) Procrastinative

The favorite answer to suggestions for website or product improvements from people from this camp is "we'll have to A/B test that" – as in "we should A/B test that, some time, when we've added A/B testing capability". It is often used as an excuse for brushing off ideas for improvement, and the fallacy here is that just because the best way to test assumptions is an A/B test doesn't mean that all assumptions are equally good or likely to be true.

Yes, A/B tests are the best way to test product improvements. But if you're not ready for A/B testing yet, that shouldn't stop you from improving your product based on your opinions and instincts.

2) Naive 

People from this group draw conclusions based on data which isn't conclusive. I've seen this several times: Results are not statistically significant, A and B didn't get the same type of traffic, A and B were tested sequentially as opposed to simultaneously, only a small part of the conversion funnel was taken into account – these and all kinds of other methodological errors can lead to erroneous conclusions.

Making decisions based on gut feelings as opposed to data isn't great, but in this case at least you know what you don't know. Making decisions based on wrong data – thinking that you understand something which you actually don't – is much worse.

3) Opinionated

There's a school of thought among designers which says that A/B testing lets you find local maxima only. While I completely agree with my friend Nikos Moraitakis that iterative improvement is no substitute for creativity, I don't see a reason why A/B testing can't be used to test radically different designs, too. 

Designers have to be opinionated. Chances are that out of the 1000s of ideas that you'd like to test, you can only test a handful because the number of statistically significant tests that you can run is limited by your visitor and signup volume. You need talented and convinced designers to tell you which five ideas out of the 1000s are worth a shot. But then do A/B test these five ideas.

4) Disillusioned

The more you learn about topics like A/B testing and marketing attribution analysis, the more you realize how complicated things are and how hard it is to get conclusive, actionable data. 

If you want to test different signup pages for a SaaS product, for example, it's not enough to look at the visitor-to-signup conversion rate. What matters is the entire funnel conversion rate, starting from visitors all through the way to paying customers. It's well possible that the signup page which performs best in terms of visitor-to-signup rate (maybe one which asks the user for minimal data input only) leads to a lower signup-to-paying conversion rate (because signups are less pre-qualified) and that another version of your signup page has a better overall visitor-to-paying conversion. To take that even further, it doesn't stop at the signup-to-paying conversion step as you'll want to track the churn rate of the "A" cohort vs. "B" cohort over time.

If you think about complexities like this, it's easy to give up and conclude that it's not worth the effort. I can relate to that because as mentioned above, nothing is worse than making decisions which you think are data-driven but which actually are not. Nonetheless I recommend that you do use split testing to test potential improvements of your conversion funnel – just know the limitations and be very diligent when you draw conclusions.

What do you think? Did you already fall prey to (or see other people fall prey to) one of the fallacies above? Let me know!



12 comments:

Nikos Moraitakis said...

Hi Christoph - great post, all of the above ring true. Apparently I'm a card-carrying member of the "opinionated" club :)


One thing to note is that people who fall into the "don't base your design decisions on A/B testing" mindset are often not dismissing A/B as a design tool, but instead postponing it. A/B testing is a great tool to learn something from an audience or home in on an optimal setup, but unless you already have volumes of data points and clear product-market fit, it's not very informative. So, the "opinionated" argument is not to be opinionated (in other words blind) forever, but first get the product to volumes that will let A/B testing actually work and tell you something.


What often happens (I know I suffer from that) is that by refusing to test too much, testing doesn't become part of your company culture and when you get to the point where it would be valuable you haven't got the tools and habit of it.


In you experience, what are a few examples of things worth testing even in an early stage SaaS that doesn't have huge click volumes? If someone were to at least test out 2-3 things, where should they start?

Georgi Kadrev said...

Great post to clarify typical miss-conceptions. A/B is now as over-used term as MVP was in the last few years, not that they are wrong, but just over-used IMHO :)


I'd say most people actually perform multi-variate tests that they call A/B, probably not a bad thing still. Interested to know what's your experience with that and have you tried both (separately). It's quite traffic greedy to A/B test each single component, and they also interfere somehow, so it'd be useful to know what's your attitude toward granularity of the testing.

Zac Aghion said...

What we’ve found is that a successful approach to A/B testing is really dependent on who you are as a company and what you're selling. Small 'shallow' changes result in equally small results, and so require relatively greater levels of statistical power in order to achieve significance. In an information economy where data is power, few startups have the access to sample size that is required to be successful in this approach.

For mega-traffic companies like Google or Amazon, these kinds of tests are worth it because a sub-1% lift in performance can be scaled out and have a substantial impact on their bottom line.

But for everyone else, ‘shallow’ A/B tests will often yield inconclusive results at best. We’ve seen that deeper changes to the product, UI layouts or entire UX workflows are what really move the needle.

Designing such tests requires more thought and development work up-front – but at least you’ll be making worthwhile improvements in an experimentally rigorous way instead of just spinning your wheels with some one-off design tweaks. You should be designing these tests with empathy for your audience (note: I've written more on empathic A/B testing here: http://splitforce.com/blog/empathic-ab-testing/). Ask the questions: What changes can I make to my product that would motivate my users to take the actions I want them to take? What are they looking for? What do they care about? And why? More often than not, I think you’ll find that the answer is not ‘a different button color’...

There are other ways to design an experiment and get a higher Return On Data (ROD). One way is to set a Minimum Detectable Effect, or minimum lift that you want to achieve from a test. Validating a small change in improvement requires more statistical power than validating a large change, and at some point in order to justify a continuation of the test you’ll want to achieve some minimum amount of lift. Once you can say with statistical confidence that this desired lift isn’t achievable, you can stop the test earlier and move on.

Another way is by breaking away from the 'fixed-proportions' model of either A/B, split or multivariate testing. The problem with these techniques is that, for the duration of the 'exploration' or testing phase, a fixed proportion of your traffic is being exposed to something that is worse. The problem is, you don't know what is worse. But with each additional piece of information that you collect, the picture becomes clearer and you can use that incremental clarity to get a higher ROD.

We’ve done a lot of research into this more dynamic approach, and have found that using an unsupervised learning algorithm almost always leads to faster results and higher average conversion rates. You can read more about how it works here: http://splitforce.com/resources/auto-optimization/.

Zac Aghion said...

Hi Georgi - A/B is certainly becoming a popular term and misnomer for 'experimentation in general. Interest in the term 'A/B testing' as measured by the proportion of search on Google has increased 500% of the past 5 years! See here: http://www.google.com/trends/explore#q=a%2Fb%20testing&date=6%2F2009%2063m&cmpt=q


And yet, from what we've seen about 9 out of 10 people running 'A/B tests' do not fully understand the implications of their selected experimental design nor the statistics underlying hypothesis testing more generally. The number one problem that we see is designing tests are small 'shallow' changes - usually cosmetic changes to the UI or copy. The fact is that, more often than not, small changes yield small improvements that do not justify the implicit nor explicit costs of testing.


The truth is, an A/B test in its strictest definition is actually a very poor way to answer the question 'What works better?' We've found that a more sophisticated approach that generates a higher Return on Data is that which leverages recent advances in unsupervised learning and distributed computing. We've done a lot of research on this topic in particular, and have published it to share with the public here: http://splitforce.com/resources/auto-optimization/.

chrija said...

Many thanks for your comment, Nikos! I agree, the importance of A/B testing is different in different phases, and it tends to grow over time. No disagreement here! :)

On your question, I think some of the things (not all of them design-related) worth testing even at an early stage are:

- Pricing
- Pricing pages
- Signup pages
- Lifecycle emails and "customer advocate" check-ins
- Home/landing page (interesting one from 37signals: http://signalvnoise.com/posts/2991-behind-the-scenes-ab-testing-part-3-final)

zackliscio said...

Christoph,

As a company that makes A/B testing software for enterprise, one thing we've seen repeatedly is that the potential reward of setting up a test has to outweigh the time and effort that goes into setting it up. People genuinely seem to want to test every aspect of their web presence, but are strapped for time and lack direction.

We create multiple versions of how content is posted to social media when users organically share it, then optimize for clickthrough and a few other important metrics. For a high-volume website, there's no way that the time to set up experiments for every single piece of content will be justified. As a result, we ended up with a mix of human and automatic creation of variants, similar to the modal Peter Thiel talks about for fraud mitigation at Paypal. For all content, we automatically create versions b, c, d etc., while highlighting the biggest opportunities for a human to create a/b tests with impact.

Zack Liscio, Naytev

Tristan Handy said...

Love this post. Many of the things you point out are things that our marketing org has learned over the past 18 months, so the post definitely resonates.

The one thing I'd somewhat take issue with your point about how deep to go in the funnel when counting a "conversion". I don't disagree with your logic, rather I think you're setting people up to be overwhelmed for a point that's really not that important.

Let's say you have a trial signup page with 5 fields vs one with 2 fields. Let's say that the one with 2 fields converts better. I am 100% comfortable making the assumption that this is an unambiguously good outcome. Why? Because I would be very surprised if the people who were motivated enough to sign up when the form had 2 fields were somehow not motivated enough to sign up when the barrier to entry was *lower*. So, I would make the simplifying assumption that any additional conversions were only increasing the size of the pie, not attracting a completely different set of users. Therefore, whatever incremental pie there was, the worst possible outcome is that none of these people converted into customers. More likely, however, is that there was at least *some* increase in customer count.

This is very different than, say, evaluating the efficacy of two completely different marketing channels based on top-of-funnel metrics. If you're going to evaluate adwords against facebook ads, you need to go all the way down the funnel, and into cohorts, to effectively answer this question. This is because the two populations are completely different and you can't make any simplifying assumptions about their relative performance.

Making well-reasoned assumptions is at the heart of good marketing analysis, because every single question we ask could turn into a clusterf**k of a research project if we let it. But spending too much time on analysis actually reduces the productivity of a lean team--it cuts into the time actually spent on accretive activities. This is where experience and judgment have a lot to add. I find this is where I am spending a ton of my time right now.

Chris Neumann said...

A couple things on this - pretty much all of marketing is done on the margin, so you're sort of both right - it's good to get incremental customers into the pipeline, but as you're attracting incrementally less engaged customers, they will generally cost more to attract and have higher demands, so you have to keep track of all of this stuff and run the numbers to see what makes sense on the margin given the corporate profit margin targets and LTV assumptions.

Chris Neumann said...

I've been doing conversion rate optimization on a consulting basis for a number of companies for several years now (see CROmetrics.com) and one thing I'll add to your list is that there generally isn't someone who owns the conversion rate process. It ends up being sort of a backwater of both product and marketing. I've worked for both product leaders and marketing leaders, and I generally find it's not on the job description of either role, so when it comes time to report, they have to prioritize the activities that they're being measured on, even if CRO has a very measurable ROI.


One other thing I've seen is the engineering team being resistant to it, and finding the changes made to the site disruptive, or thinking they're smarter than the marketing team, and taking aggressive action to disrupt the process. That's actually the sign of much bigger problems at the company though.


I do also run into the issues you mention above.

Thomas Roehm said...

Volkswagen
made 2012 an unbelievable net profit of €19 BN. So car industry is maybe a
dinosaur, but a very professional one. Why do I talk about VW?

Because they do A/B testing. For the design of a new model all car manufacturers
usually test 2 different designs, an effort that costs 500k€. Per test. And now
the interesting news: the result of these extremely expensive tests DOESN'T
determine the decision, which design will be shipped. This decision stays a
"guts" decision of Design Director and CEO.

WHY?

There's
undoubtedly a ton of things to learn from your customers, but certainly not how
the future is shaped. Or, in the words of Steve Jobs: if you ask the client, he will say: “I want the same in better, faster, cheaper”. Personally I believe in data/information,
sure it is the base of all decisions, but I'm convinced that (big) data is not
the end of the story. Because big data looks only back in a customer’s life,
but it's 1000 times harder to predict the future decisions of a clients.

Why the hell eBay proposes me always the stuff I bought already, and never what
I want to buy tomorrow???

chrija said...

Good points and agree!

Fuck my pussy said...

I am 18+ Canadian horny girls.You can hire me full night.I am ready to sex any sexy boys.

Fuck my pussy