Multi-armed bandit

Multi-Armed Bandit (MAB) is a form of machine learning that typically employs algorithms to automatically optimize traffic allocation to different variations of an experience, based on which of these are performing well. These are the eponymous multiple “arms”. The algorithms are named bandit algorithms after the one-armed bandit slot machines in Vegas.

What problems do the multi-armed bandit solve?

The problem solved by the multi-armed bandit (MAB) is similar to the situation of a gambler who, faced with many one-armed-bandit slot machines, with different and unknown payouts, must tradeoff exploiting the best-paying machine they’ve found so far with exploring other machines that may offer better payouts.

More generally, the MAB algorithm may be used to automatically manage any kind of exploration/exploitation tradeoff.

How does the multi-armed bandit differ from A/B testing?

A/B/n testing tries to find the best of potentially many variations by splitting visitor traffic amongst them. These tests may involve fixed horizons or, more commonly adaptive designs, with interim analyses.

Once an A/B test is concluded, if it isolates a statistically significant improvement over a control, then this winning variations can be served to all future visitors.

A/B/n testing is an optimal approach for finding these winners. The problem with this strategy is that once you find the winning variation, you’ll wish you’d been showing it to more visitors, sooner. Multi-Armed Bandit (MAB), instead, tries to use the best options, as quickly and as often as possible. It does this, typically, by not using the suboptimal variations as soon as it knows they’re suboptimal. The assumption MAB is making here is that we don’t care to measure exactly how suboptimal they are.

Basically, MAB trades away some of the potential future statistical significance and exploration power of an A/B test, for definite short-term improvements in the exploitation of the value of the winning variation.

How does the multi-armed bandit apply to personalization?

There are two ends of a spectrum on which a Multi-Armed Bandit (MAB) can be used for personalization: the extreme ends of this spectrum are “auto-optimization for a personalization” and “auto-personalization for CRO”. Namely, MAB can try to find the best variation for a predefined audience of users you want to target, or it can make 1:1 personalized decisions about which variation to show to everyone.

In this first use case, the MAB is essentially a “smart A/B test” for the personalization you’re considering. Given an audience defined in terms of user attributes, environmental factors like weather, segment membership, and similar kinds of personal context, it will automatically show your audience what it thinks is best based on behavior it has seen from other audience members. This is a good approach if you want the MAB to automatically optimize a uniform experience to a carefully curated, personalized audience.

In the second, true 1:1 personalization use case, the MAB makes personalized decisions to enhance their automatic optimization. It can directly use user attributes, environmental factors like weather, segment membership, and similar kinds of personal context in its decision making. It figures out which audiences work for which variations on its own. This is a good approach if you want the MAB to automatically personalize an experience to a mass audience.