Home Blog About Us Work we do Content Contact Us

Newsletter: Click here to join an email list and be contacted when a new article is published.




Multi Armed Bandit

Imagine you are standing in front of a row of slot machines, and wish to gamble. You have a bag full of coins. Your goal is to maximize the return on your investment. The problem is that you don’t know the payout percentages of any of the machines. Each has a, potentially, different expected return.

What is your strategy?

Image: Thomas Hawk


You could select one machine at random, and invest all your coins there, but what happens if you selected a poor payout machine? You could have done better.

You could spread your money out and divide it equally (or randomly) between all the different machines. However, if you did this, you’d spend some time investing in poorer payout machines and ‘wasting’ coins that could have be inserted into better machines. The benefit of this strategy, however, is diversification, and you’d be spreading your risk over many machines; you’re never going to be playing the best machine all the time, but you’re never going to be playing the worst all the time either!

Maybe a hybrid strategy is better? In a hybrid solution you could initially spend some time experimenting to estimate the payouts of the machines then, in an exploitation phase, you could put all your future investment into the best paying machine you’d discovered. The more you research, the more you learn about the machines (getting feedback on their individual payout percentages).

However, what is the optimal hybrid strategy? You could spend a long time researching the machines (increasing your confidence), and the longer you spend, certainly, the more accurate your prediction of the best machine would become. However, if you spend too long on research, you might not have many coins left to properly leverage this knowledge (and you’d have wasted many coins on lots of machines that are poor payers). Conversely, if you spend too short a time on research, your estimate for which is the best machine could be bogus (and if you are unlucky, you could become victim to a streak of ‘good-luck’ from a poor paying machine that tricks you into thinking it’s the best machine).

If you are playing a machine that is "good enough", is it worth the risk of attempting to see if another machine is "better" (experiments to determine this might not be worth the effort).


Maybe we could devise an adaptive system that changes? After all, after each pull, we gain information about the system. Also, what happens if, heaven forbid, the machines slowly change their payout percentages as we interact with them?

Multi Bandit Problems

These kind of problems are known, affectionately, by Data Scientists as the “Multi-Bandit Problems” (in reference to slot machines, which are also known as one-armed bandits).

The multi-bandit problem has many, many applications:

Various Strategies

There are many, heavily studied, algorithms for how to address the multi-bandit problem. Here is a high level look at a couple of them:

Epsilon-first strategy

In this strategy, the contest is broken into two phases: An exploration phase, follow by an exploitation phase. A value of epsilon ε is selected (e.g. ε=0.1), and for the first 10% of the time, a machine is selected entirely at random (uniform distribution of probability). After these trials are done, for the remaining trials (1-ε) the single machine that had the highest expected return from the exploration phase is used exclusively.

As you can see, this strategy uses a percentage of the expected pulls to determine the best machine to use, and for the rest of the pulls the best machine found is used.

Epsilon-Greedy Strategy

A modification of this is an adaptive approach. Instead of using a dedicated exploration phase, it is a continuous process. A value of epsilon ε is selected (e.g. ε=0.1), and for this percentage of trials, a random machine is selected. For the remaining 1-ε trials, the current leading machines is used.

In the above two strategies, if ε=0.0 then the algorithms are entirely greedy.

Epsilon-decreasing Strategy

A further hybrid of this is the epsilon-decreasing approach. Similar to Simulated annealing, the value of ε decreases over the experiment. Early on in the process, ε starts with a high value, so there is a good chance that a random exploration pull will be made. Over time, the value of ε decreases (typically as an exponential decay), so that a higher percentage of the later pulls are exploitation pulls and not an experimentation pulls.

There are more complex variants of the above based on using softmax weightings in the exploration phase.

Bayesian Bandits

Then there is a category of algorithms based on information on past pulls. Rather than selecting, digitally, the machine to pull, the handle of the next machine to pull could be based, probabilistically, on past performance. Yes, the “rich would get richer”, but there is still a chance that a less optimal machine might be selected (albeit with a lower probability), and if this turned out to be a good result, that it would get reinforcement, and thus more likely to be pulled in future rounds; gradually creeping up.

Secretary Puzzle

Readers of my blog might see an analogy between this problem and the Sultan's Wife problem (also known as the sectrary problem). The differences, however, are obvious that, for the secretary problem, you don't get chance to sample more than once, and you don't get the option of returning.


You can find a complete list of all the articles here.      Click here to receive email alerts on new articles.

© 2009-2017 DataGenetics    Privacy Policy