A/B and Qualitative User Testing

| 05.21.2009

Recently, I worked with a company devoted to A/B testing. For those of you who aren’t familiar with the practice, A/B testing (sometimes called bucket testing or multivariate testing) is the practice of creating multiple versions of a screen or feature and showing each version to a different set of users in production in order to find out which version produces better metrics. These metrics may include things like “which version of a new feature makes the company more money” or “which landing screen positively affects conversion.” Overall, the goal of A/B testing is to allow you to make better product decisions based on the things that are important to your business by using statistically significant data.

Qualitative user testing, on the other hand, involves showing a product or prototype to a small number of people while observing and interviewing them. It produces a different sort of information, but the goal is still to help you make better product decisions based on user feedback.

Now, a big part of my job involves talking to users about products in qualitative tests, so you might imagine that I would hate A/B testing. After all, wouldn’t something like that put somebody like me out of a job? Absolutely not! I love A/B testing. It’s a phenomenal tool for making decisions about products. It is not the only tool, however. In fact, qualitative user research combined with A/B testing creates the most powerful system for informing design that I have ever seen. If you’re not doing it yet, you probably should be.

A/B Testing

What It Does Well

A/B testing on its own is fantastic for certain things. It can help you:

  • Get statistically significant data on whether a proposed new feature or change significantly increases metrics that matter – numbers like revenue, retention, and customer acquisition
  • Understand more about what your customers are actually doing on your site
  • Make decisions about which features to cut and which to improve
  • Validate design decisions
  • See which small changes have surprisingly large effects on metrics
  • Get user feedback without actually interacting with users

For example, imagine that you are creating a new check out flow for your website. There is a request from your marketing department to include an extra screen that asks users for some demographic information. However, you feel that every additional step in a check out process represents a chance for users to drop out, which prevents purchases. By creating two flows in production, one with the extra screen and one without, and showing each flow to only half of your users, you can gather real data on how many purchases are completed by members of each group. This allows you to understand the exact impact on sales and helps you decide whether gathering the demographic information is really worth the cost.

Even more appealing, you can get all this user feedback without ever talking to a single user. A/B testing is, by its nature, an engineering solution to a product design problem, which makes it very popular with small, engineering-driven startups. Once the various versions of the feature are released to users, almost anybody can look at the results and understand which option is doing better, so it can all be done without having to recruit or interview test participants.

Of course, A/B testing in production works best on things like web or mobile applications where you can not only show different interfaces to different customers, but where you can also easily switch all of your users to the winning interface without having to ship them a new box full of software or a new physical device. I wouldn’t recommend trying it if you’re designing, for example, a car.

What It Does Poorly

Now imagine that, instead of adding a single screen to an already existing check out flow, you are tasked with designing an entirely new check out flow that should maximize revenue and minimize the number of people who abandon their shopping carts. In creating the new flow, there are hundreds of design decisions you need to make, both small and large. How many screens should it have? How much up-selling and cross-selling should you do? At what point in the flow do you ask users for payment information? What should the screens look like? Should they have the standard header and footer, or should those be removed to minimize potential distractions for users when purchasing? And on and on and on…

These are all just a series of small decisions, so, in an ideal world, you’d be able to A/B test each one separately, right? Of course, in the real world, this could mean creating an A/B test with hundreds of different variations, each of which has to be shown to enough users to achieve statistical significance. Since you want to roll out your new check out process sometime before the next century, this may not be a particularly appealing option.

A Bad Solution

Another option would be to fully implement several very different directions for the check out screens and test them all against one another. For example, let’s say you implemented four different check out processes with the following features to test against one another:

Option 1: Option 2: Option 3: Option 4:
  • Yellow Background
  • Three Screens
  • Marketing Questions
  • No Up-selling
  • No Cross-Selling
  • Header
  • No Footer
  • Help Link
  • Blue Background
  • Two Screens
  • No Marketing Questions
  • Up-selling
  • No Cross-Selling
  • Header
  • Footer
  • No Help
  • Orange Background
  • Four Screens
  • Marketing Questions
  • Up-selling
  • Cross-Selling
  • No Header
  • Footer
  • Live Chat Help
  • White Background
  • One Screen
  • No Marketing Questions
  • No Up-selling
  • Cross-Selling
  • No Header
  • No Footer
  • Live Chat Help

This might work in companies that have lots of bored engineers sitting around waiting to implement and test several different versions of the same code, most of which will eventually be thrown away. Frankly, I haven’t run across a lot of those companies. But even if you did decide to devote the resources to building four different check out flows, the big problem is that, if you get a clear winner, you really don’t have very clear idea of WHY users preferred a particular version of the check out flow over the others. Sure, you can make educated guesses. Perhaps it was the particularly soothing shade of blue. Or maybe it was the fact that there weren’t any marketing questions. Or maybe it was aggressive up-selling. Or maybe that version just had the fewest bugs.

But the fact is, unless you figure out exactly which parts users actually liked and which they didn’t like, it’s impossible to know that you’re really maximizing your revenue. It’s also impossible to use those data to improve other parts of your site. After all, what if people HATE the soothing shade of blue, but they like everything else about the new check out process? Think of all the money you’ll lose by not going with the yellow or orange or white. Think of all the time you’ll waste by making everything else on your site that particular shade of blue, since you think that you’ve statistically proven that people love it!

What Qualitative Testing Does Well

Despite the many wonderful things about A/B testing, there are a few things that qualitative testing just does better.

Find the Best of All Worlds

Qualitative testing allows you to test wildly different versions of a feature against one another and understand what works best about each of them, thereby helping you develop a solution that has the best parts from all the different options. This is especially useful when designing complicated features that require many individual decisions, any one of which might have a significant impact on metrics. By observing users interacting with the different versions, you can begin to understand the pros and cons of each small piece of the design without having to run each one individually in its own A/B test.

Find Out WHY Users Are Leaving

While a good A/B test (or plain old analytics) can tell you which page a user is on when they abandon a check out flow, it can’t tell you why they left. Did they get confused? Bored? Stuck? Distracted? Information like that helps you make better decisions about what exactly it is on the page that is causing people to leave, and watching people using your feature is the best way to to gather that information.

Save Engineering Time and Iterate Faster

Generally, qualitative tests are run with rich, interactive wireframes rather than fully designed and tested code. This means that, instead of having your engineers code and test four different versions of the flow, you can have a designer create four different HTML prototypes in a fraction of the time. HTML prototypes are significantly faster to produce since:

  • They don’t have to run in multiple browsers, just the one you’re testing
  • They don’t have any backend code that needs to be done
  • They frequently don’t have a polished visual design (unless that’s part of what you’re testing)

And since making changes to a prototype doesn’t require any engineering or QA time, you can iterate on the design much faster, allowing you to refine the design in hours or days rather than weeks or months.

How Do They Work Together?

Qualitative Testing Narrows Down What You Need to A/B Test

Qualitative testing will let you eliminate the obviously confusing stuff, confirm the obviously good stuff, and narrow down the set of features you want to A/B test to a more manageable size. There will still be questions that are best answered by statistics, but there will be a lot fewer of them.

Qualitative Testing Generates New Ideas for Features and Designs

While A/B testing helps you eliminate features or designs that clearly aren’t working, it can’t give you new ideas. Users can. If every user you interview gets stuck in the same place, you’ve identified a new problem to solve. If users are unenthusiastic about a particular feature, you can explore what’s missing with them and let them suggest ways to make the product more engaging.

Talking to your users allows you to create a hypothesis that you can then validate with an A/B test. For example, maybe all of the users you interviewed about your check out flow got stuck selecting a shipment method. To address this, you might come up with ideas for a couple of new shipment flows that you can test in production once you’ve confirmed that they’re less confusing with another quick qualitative test.

A/B Testing Creates a Feedback Loop for Researchers

A/B tests can also improve your qualitative testing process by providing statistical feedback to your researchers. I, as a researcher, am going to observe participants during tests in order to see what they like and dislike. I’m then going to make some educated guesses about how to improve the product based on my observations. When I get feedback about which  recommendations are the most successful, it helps me learn more about what’s important to users so I make better recommendations in the future.

Any Final Words?

Separately, both A/B testing and qualitative testing are great ways to learn more about your users and how they interact with your product. Combined, they are more than the sum of their parts. They form an incredibly powerful tool that can help you make good, user-centered product decisions more quickly and with more confidence than you have ever imagined.

Quicken Picks Redesign Launched

| 05.20.2009

Our redesign of the Quicken Picks website has just launched!

New home page

Key features of our redesign effort:

  • Three step home page for first time users: First time user experience clearly explains the benefits of cash back
  • Deal badges: Deals are now presented in a consistent, streamlined manner
  • Clean information architecture:  Flattened navigation to present primary store categories on the home page.
  • Simple deal flow: Users can now get from the a deal on Quicken Picks to the target site in one click

Here’s our three step design on the home page (it follows a design pattern we’ve blogged about before):

new_signin

Before our redesign

Here are some screenshots of the site before our redesign to give you a sense of the impact of our work. It’s a bit green and gunky to say the least:

Home page – Logged in

home

Home page – First time visitor

signin

Feel free to check out the redesign yourself at the www.quickenpicks.com.

Balsamiq vs. HTML Wireframes

| 05.4.2009

Recently, I discovered a new prototyping tool for creating rough, sketch-style UI designs called Balsamiq Mockups. It’s basically a lo-fi mockup tool with a built-in library of sketch-style UI elements that can be easily dropped onto a workspace and edited.

After test driving this tool on my own,  I decided to see how Balsamiq Mockups sketches compared to rough HTML wireframes in the context of a user study.

First impressions using Balsamiq

mockups_fpa

After demoing the tool for only a few minutes, I thought, “Wow, nice job!” It was fun to play with, drop dead simple to use, and required no tedious application tutorials or programming knowledge in order to just dig right in.  I liked how quick and easy it was to drag and drop library items into the workspace and quickly adjust and rearrange them as needed.  After only a short time using Balsamiq, I could create simple mockups just as fast (if not faster) than I could in Dreamweaver, the tool I use most frequently for rapid prototyping.  I also found it very easy to export my Balsamiq sketches as static files and add interactive behaviors on top of them using hotspots.

However, the real strength of Balsamiq comes from the fact that the tool does not simply emulate other prototyping tools that make realistic-looking mockups (such as Visio, PowerPoint, GUI Design Studio,  and MockupScreens). Instead, Balsamiq takes an entirely different approach and deliberately uses a rough, hand-drawn look and feel — presumably to better communicate that these designs are just early ideas in-progress and are not fully fleshed out yet.   I hypothesized that Balsamiq Mockups would be especially good for communicating early design concepts, to keep users focused on the high-level ideas, rather than on the details.

Balsamiq vs. HTML usability test

To test my hypothesis, I conducted a comparison study to find out if the sketch-like quality of Balsamiq mockups had any effect on the type of feedback that users provide in the context of a study.

The research process

For the study, I used an A/B test format. First, I mocked up the same high level design using two different methods:  version one was created in Balsamiq, and version two was a very rough, grayscale, HTML wireframe created in Dreamweaver.  Then, I showed half the participants the version one Balsamiq sketch;  the other half of the participants saw the version two HTML wireframe. (See images below with text blocked out for client IP reasons)

Version 1: Balsamiq sketch

balsamiq_lorem

Version 2: HTML wireframe

html_lorem

In both cases, I told users to share their initial impressions about the design. I explained that the designs were still a work-in-progress and I was looking for high-level feedback regarding the concept and overall functionality, rather than the specific language, details, or look and feel of the page. I then noted their feedback to see if there were differences between the two groups.

The results

All participants were able to provide high-level feedback about both types of mockups. However, participants who saw the Balsamiq sketches were slightly less concerned with the details and specifics of the design. They tended to comment less on the colors and appearance of the design, as well as the particular language and text used.  They were also more aware of the fact that they were looking at an in-progress design, making comments such as, “I don’t really like the way this dropdown menu works, but know the design for it might change later anyway.”
In both cases, users still made comments on specific page interactions, but the Balsamiq participants were more likely to be aware that this type of feedback was overly specific, as one participant qualified his feedback by saying, “I guess that’s not really the kind of feedback you’re looking for right now.”

Overall, I found that the participants who saw the Balsamiq sketches did a slightly better job of keeping their feedback more conceptual than the participants who saw the HTML wireframes. However, participants in both cases still commented on some of the details of the mockups as well, and it was necessary to use the moderator questions to guide the user’ focus and attention back to the level of feedback I was looking for.

Conclusion

Balsamiq Mockups is a useful new prototyping tool. It’s perfect for sketching quick ideas and sharing with others, especially if the mockups are fairly simple in layout and overall complexity. It’s also a great tool for concept testing with users, as the sketch-like quality of the designs works well for gathering high-level feedback on initial design ideas.