A/B testing often is a powerful part of the UX design process that provides unique insights into how changes in design affect user behavior. Out of a few redesign concepts, you can pick the one that performs better. With this data in hand, you can optimize the efficiency of your design and bridge the gap beteween business and UX.
As a very sharp tool in a UX researcher’s toolbox, A/B testing needs to be treated carefully. There are a lot of potential pitfalls along the way. Here are some of the most common A/B testing mistakes made while using this popular research method—and tips for fixing them.
What Is A/B Testing?
But first, let’s cover the basics. Depending on your industry or even your department, A/B testing may go by a few different titles: A/B/n testing, bucket testing, or split testing. The concept remains the same regardless of name: an A/B test is an experiment where a few different designs (called variants) compete with each other and is a crucial skill to master when learning UX.
A/B testing is used in e-commerce and advertising. The experiments help to optimize click-through rates; revenue flows, etc. But the usefulness of that method isn’t just limited to commerce. It can be used to prove any suitable hypothesis and to optimize any quantifiable metric in the product (e.g. retention, feature discoverability, social interactions).
In other words, if you can formulate your problem using quantitative metrics, you can use A/B testing to solve it.
Different users see different variations of a web page. (Image source)
The idea is that all variants co-exist within the product at the same time, but they are shown to different users. Because each user is only exposed to one variant, the experience is consistent. After a few weeks of gathering and analyzing data, testers can identify the variant that performs better and conclude the experiment.
Scheme of a simple A/B test. (Image source)
After the test is complete, the winner is usually shown to all the users.
Each experiment consists of the following parts:
What do we want to achieve by adjusting our design? A conversion-rate improvement for our landing page, discoverability of a new feature, a new redesign: all of these are example areas where we search for a goal.
Based on the goal, we formulate a hypothesis. “Our landing page conversion is not good enough because the call-to-action button isn’t visible” or “our new feature will be more discoverable if we add it to our onboarding screens.”
With the hypothesis at hand, it’s time to brainstorm variants. You often only need one as you will always have your current design competing with them in order to provide a baseline.
However, having two or three variants is even better. It might look like this: “let’s make our call-to-action button animate when the webpage is loaded” or “let’s show onboarding on the first launch” or “let’s show onboarding when a user first visits the screen that has this feature.”
Buckets of users
Once you have your variants, you need to distribute them to the users. Usually, the majority of the users (50 to 90 percent) will keep seeing the old design without changes. This is the control group, providing data for baseline performance.
Examples of possible distributions:
– 80 percent of users are in the control group, 20 percent get animated call-to-action button
– 70 percent of users are in the control group, 15 percent get text onboarding, 15 percent get video onboarding
After all four parts of the process are ready, we can run the test by deploying the variants to our users and gathering the metrics. The recommended time span for an A/B test is three to six weeks, long enough that we can filter the actual results from noise.
If you’re tempted to tweak the timeline, consider the following:
- You need to have enough users who’ve seen your variants. Some mathematical formulas and tools, like Optimizely’s sample size calculator or this test calculator, can help calculate how many users is enough for your particular test.
- Usually, it is recommended to run tests for a few weeks, as people’s behavior might differ substantially on weekends vs. business days.
When the experiment is done, it’s time to interpret the results. Keep in mind that any result is good, regardless of whether the hypothesis is proven or disproven.
Now, let’s see what could go wrong.
8 Typical Mistakes
Stopping tests too early
The first and arguably most common problem is when the execution time is too short. Many inexperienced testers are so anxious to see results that they conclude the test without sufficient data. But the shorter the test is, the less precise the results will be.
So, how long should you run the test? It depends on the audience size, number of variants, and number of users per variant. As discussed above, three to six weeks is usually a safe bet. Some services provide tools (for example: one, two, three) that, after you set the parameters of your product, calculate necessary exposure and experiment time.
Sending too little traffic to variations
Running experiments on the majority of users is a real risk, as we don’t know if our variations will perform better or worse than our current design. That said, it’s important not to be too cautious when allocating user buckets, as that can lead to external noise and negatively impact the data, just like shortening the test.
The minimum percentage per variation will differ for each case, but a good rule of thumb is to put at least 5 percent of users in a variant.
Using too many variations
While it is tempting to test every variation you can think of, having too many can be problematic: each variation’s set of users may be too small or there might be multiple winners because the variations aren’t different enough.
Having three to five variations is usually a good choice. If you have more ideas than that, try grouping them.
Not taking seasonal changes into account
In theory, all days should be similar to one another. But in business practice, that’s usually not the case. For a ski equipment store, it’s natural that certain seasons are more active than others. Travel agencies, sports stores, gyms, and almost every business have these fluctuations on different levels: daily, weekly, monthly, or seasonally.
It is useful to know when these changes occur and to take them into account when planning and executing A/B tests.
This is another reason to keep your control group big enough (30+ percent); seasonal changes will be visible there too.
Running multiple A/B tests at the same time
It takes time to perform an A/B test, so testers sometimes consider running multiple tests at once to get results faster. The reality is that this will likely cause the tests to interfere with each other and skew the results.
It’s important to take your time and to ensure tests aren’t run at the same time. Often there are some interconnections that aren’t obvious. If you have to run experiments at the same time (say, AB and XY), make sure that the variations of AB are in the control group of XY and vice versa.
Ignoring cultural differences
When the product or service is available internationally, it is very tempting to experiment in a smaller market.
While still being a source of valuable information, you should interpret these results with caution. If we have two markets, X and Y, and our A/B test in Y shows better results for variation B, we might assume that it would perform better in X as well. That is not necessarily true! Social, cultural, and geographical factors might come into play, and the design that works perfectly in X might fail in Y.
Test your assumptions in the market where you are planning to launch them. Testing some ideas first in a smaller market is valuable, but don’t extrapolate behaviors from there.
Using dark patterns to improve numbers
A/B testing is a quantitative research method and shows how efficient your user interface is to reach a goal. Improving numbers might be a very difficult task, but it’s important not to try and cheat the user experience via shortcuts like “dark patterns.” These patterns just trick a user into doing whatever we want her to do.
Consider this: If you remove the “unsubscribe” link from your emails, fewer people will unsubscribe for sure, but the data won’t be valid. Dark patterns, while efficient in the short-term, hurt the product long-term.
Don’t focus on the numbers alone. Make sure your designs not only perform well, but also feel great to use. Use other research types, like guerilla testing and user interviewing, to check for these traits.
Not preparing quantified goals
Sometimes the goals of the A/B test aren’t specific enough. “We need to improve the performance of a sign-up flow” or “we need to make our checkout feature better,” for example.
How do these goals map to metrics? Is a conversion rate increase of 1 percent an “improved performance” or is that not enough? Are you even tracking the right metrics while a test is running?
Spend some time in advance and make the goals very specific and metrics-based:
- “We need to improve the performance of a sign-up flow” → “decrease bounce rate at the end of the sign-up flow by at least 2 percent”
- “We need to make the checkout feature better” → “we need to increase the percentage of users linking their credit card to our website by 1 percent”
A/B testing is an effective tool that helps collect real data from your users and allows you to check how design changes are performing. However, A/B testing also requires a lot of work to implement correctly. Most of the pitfalls we outlined here are the result of poor preparation or trying to take shortcuts. Patience and attention to detail are your best friends when running these tests.
Want to learn more about how to create engaging, effective user experiences? Check out Springboard’s UX Design Course—you’ll have your own personal mentor, plus access to a global community of UX experts.