camdavidsonpilon.github.io/2013/10/14/Multi-Armed-Bandits.html

Preview meta tags from the camdavidsonpilon.github.io website.

Linked Hostnames

Search Engine Appearance

Google

https://camdavidsonpilon.github.io/2013/10/14/Multi-Armed-Bandits.html

Multi-Armed Bandits

Preface: This example is a (greatly modified) excerpt from the open-source book Bayesian Methods for Hackers, currently being developed on Github Adapted from an example by Ted Dunning of MapR Technologies The Multi-Armed Bandit Problem Suppose you are faced with \(N\) slot machines (colourfully called multi-armed bandits). Each bandit has an unknown probability of distributing a prize (assume for now the prizes are the same for each bandit, only the probabilities differ). Some bandits are very generous, others not so much. Of course, you don't know what these probabilities are. By only choosing one bandit per round, our task is devise a strategy to maximize our winnings. Of course, if we knew the bandit with the largest probability, then always picking this bandit would yield the maximum winnings. So our task can be phrased as "Find the best bandit, and as quickly as possible". The task is complicated by the stochastic nature of the bandits. A suboptimal bandit can return many winnings, purely by chance, which would make us believe that it is a very profitable bandit. Similarly, the best bandit can return many duds. Should we keep trying losers then, or give up? A more troublesome problem is, if we have a found a bandit that returns pretty good results, do we keep drawing from it to maintain our pretty good score, or do we try other bandits in hopes of finding an even-better bandit? This is the exploration vs. exploitation dilemma. Applications The Multi-Armed Bandit problem at first seems very artificial, something only a mathematician would love, but that is only before we address some applications: Internet display advertising: companies have a suite of potential ads they can display to visitors, but the company is not sure which ad strategy to follow to maximize sales. This is similar to A/B testing, but has the added advantage of naturally minimizing strategies that do not work (and generalizes to A/B/C/D... strategies) Ecology: animals have a finite amount of energy to expend, and following certain behaviours has uncertain rewards. How does the animal maximize its fitness? Finance: which stock option gives the highest return, under time-varying return profiles. Clinical trials: a researcher would like to find the best treatment, out of many possible treatments, while minimizing losses. Many of these questions above are fundamental to the application's field. It turns out the optimal solution is incredibly difficult, and it took decades for an overall solution to develop. There are also many approximately-optimal solutions which are quite good. The one I wish to discuss is one of the few solutions that can scale incredibly well. The solution is known as Bayesian Bandits. A Proposed Solution Any proposed strategy is called an online algorithm (not in the internet sense, but in the continuously-being-updated sense), and more specifically a reinforcement learning algorithm. The algorithm starts in an ignorant state, where it knows nothing, and begins to acquire data by testing the system. As it acquires data and results, it learns what the best and worst behaviours are (in this case, it learns which bandit is the best). With this in mind, perhaps we can add an additional application of the Multi-Armed Bandit problem: Psychology: how does punishment and reward effect our behaviour? How do humans' learn? The Bayesian solution begins by assuming priors on the probability of winning for each bandit. In our vignette we assumed complete ignorance of the these probabilities. So a very natural prior is the flat prior over 0 to 1. The algorithm proceeds as follows: For each round, Sample a random variable \(X_b\) from the prior of bandit \(b\), for all \(b\). Select the bandit with largest sample, i.e. select bandit \(B = \text{argmax}\;\; X_b\). Observe the result of pulling bandit \(B\), and update your prior on bandit \(B\). Return to 1. That's it. Computationally, the algorithm involves sampling from \(N\) distributions. Since the initial priors are \(\text{Beta}(\alpha =1,\beta =1)\) (a uniform distribution), and the observed result \(X\) (a win or loss, encoded 1 and 0 respectfully) is Binomial, the posterior is a \(\text{Beta}( \alpha = 1 + X, \beta = 1 + 1 - X)\) (see here for why to is true). To answer a question from before, this algorithm suggests that we should not discard losers, but we should pick them at a decreasing rate as we gather confidence that there exist better bandits. This follows because there is always a non-zero chance that a loser will achieve the status of \(B\), but the probability of this event decreases as we play more rounds (see figure below). Below is an implementation of the Bayesian Bandits strategy (which can be skipped for the less Pythonic-ly interested). Below we present a visualization of the algorithm sequentially learning the solution. In the figure below, the dashed lines represent the true hidden probabilities, which are [0.85, 0.60, 0.75] (this can be extended to many more dimensions, but the figure suffers, so I kept it at 3). Note that we don't real care how accurate we become about inference of the hidden probabilities -- for this problem we are more interested in choosing the best bandit (or more accurately, becoming more confident in choosing the best bandit). For this reason, the distribution of the red bandit is very wide (representing ignorance about what that hidden probability might be) but we are reasonably confident that it is not the best, so the algorithm chooses to ignore it. Below is a D3 demonstration of this updating/learning process. The challenge is only three arms, each with a small probability (randomly generated). The first figure is the pull/reward count, and the second figure contains the posterior distributions of each bandit.

Bing

Multi-Armed Bandits

https://camdavidsonpilon.github.io/2013/10/14/Multi-Armed-Bandits.html

DuckDuckGo

https://camdavidsonpilon.github.io/2013/10/14/Multi-Armed-Bandits.html

Multi-Armed Bandits

General Meta Tags
8
- title
  Multi-Armed Bandits | Data Origami
- charset
  utf-8
- X-UA-Compatible
  IE=edge
- viewport
  width=device-width, initial-scale=1
- generator
  Jekyll v3.9.5
Open Graph Meta Tags
6
- og:title
  Multi-Armed Bandits
- og:locale
  en_US
- og:description
  Preface: This example is a (greatly modified) excerpt from the open-source book Bayesian Methods for Hackers, currently being developed on Github Adapted from an example by Ted Dunning of MapR Technologies The Multi-Armed Bandit Problem Suppose you are faced with \(N\) slot machines (colourfully called multi-armed bandits). Each bandit has an unknown probability of distributing a prize (assume for now the prizes are the same for each bandit, only the probabilities differ). Some bandits are very generous, others not so much. Of course, you don't know what these probabilities are. By only choosing one bandit per round, our task is devise a strategy to maximize our winnings. Of course, if we knew the bandit with the largest probability, then always picking this bandit would yield the maximum winnings. So our task can be phrased as "Find the best bandit, and as quickly as possible". The task is complicated by the stochastic nature of the bandits. A suboptimal bandit can return many winnings, purely by chance, which would make us believe that it is a very profitable bandit. Similarly, the best bandit can return many duds. Should we keep trying losers then, or give up? A more troublesome problem is, if we have a found a bandit that returns pretty good results, do we keep drawing from it to maintain our pretty good score, or do we try other bandits in hopes of finding an even-better bandit? This is the exploration vs. exploitation dilemma. Applications The Multi-Armed Bandit problem at first seems very artificial, something only a mathematician would love, but that is only before we address some applications: Internet display advertising: companies have a suite of potential ads they can display to visitors, but the company is not sure which ad strategy to follow to maximize sales. This is similar to A/B testing, but has the added advantage of naturally minimizing strategies that do not work (and generalizes to A/B/C/D... strategies) Ecology: animals have a finite amount of energy to expend, and following certain behaviours has uncertain rewards. How does the animal maximize its fitness? Finance: which stock option gives the highest return, under time-varying return profiles. Clinical trials: a researcher would like to find the best treatment, out of many possible treatments, while minimizing losses. Many of these questions above are fundamental to the application's field. It turns out the optimal solution is incredibly difficult, and it took decades for an overall solution to develop. There are also many approximately-optimal solutions which are quite good. The one I wish to discuss is one of the few solutions that can scale incredibly well. The solution is known as Bayesian Bandits. A Proposed Solution Any proposed strategy is called an online algorithm (not in the internet sense, but in the continuously-being-updated sense), and more specifically a reinforcement learning algorithm. The algorithm starts in an ignorant state, where it knows nothing, and begins to acquire data by testing the system. As it acquires data and results, it learns what the best and worst behaviours are (in this case, it learns which bandit is the best). With this in mind, perhaps we can add an additional application of the Multi-Armed Bandit problem: Psychology: how does punishment and reward effect our behaviour? How do humans' learn? The Bayesian solution begins by assuming priors on the probability of winning for each bandit. In our vignette we assumed complete ignorance of the these probabilities. So a very natural prior is the flat prior over 0 to 1. The algorithm proceeds as follows: For each round, Sample a random variable \(X_b\) from the prior of bandit \(b\), for all \(b\). Select the bandit with largest sample, i.e. select bandit \(B = \text{argmax}\;\; X_b\). Observe the result of pulling bandit \(B\), and update your prior on bandit \(B\). Return to 1. That's it. Computationally, the algorithm involves sampling from \(N\) distributions. Since the initial priors are \(\text{Beta}(\alpha =1,\beta =1)\) (a uniform distribution), and the observed result \(X\) (a win or loss, encoded 1 and 0 respectfully) is Binomial, the posterior is a \(\text{Beta}( \alpha = 1 + X, \beta = 1 + 1 - X)\) (see here for why to is true). To answer a question from before, this algorithm suggests that we should not discard losers, but we should pick them at a decreasing rate as we gather confidence that there exist better bandits. This follows because there is always a non-zero chance that a loser will achieve the status of \(B\), but the probability of this event decreases as we play more rounds (see figure below). Below is an implementation of the Bayesian Bandits strategy (which can be skipped for the less Pythonic-ly interested). Below we present a visualization of the algorithm sequentially learning the solution. In the figure below, the dashed lines represent the true hidden probabilities, which are [0.85, 0.60, 0.75] (this can be extended to many more dimensions, but the figure suffers, so I kept it at 3). Note that we don't real care how accurate we become about inference of the hidden probabilities -- for this problem we are more interested in choosing the best bandit (or more accurately, becoming more confident in choosing the best bandit). For this reason, the distribution of the red bandit is very wide (representing ignorance about what that hidden probability might be) but we are reasonably confident that it is not the best, so the algorithm chooses to ignore it. Below is a D3 demonstration of this updating/learning process. The challenge is only three arms, each with a small probability (randomly generated). The first figure is the pull/reward count, and the second figure contains the posterior distributions of each bandit.
- og:url
  https://dataorigami.net/2013/10/14/Multi-Armed-Bandits.html
- og:site_name
  Data Origami
Twitter Meta Tags
1
- twitter:card
  summary
Link Tags
3
- alternate
  https://dataorigami.net/feed.xml
- canonical
  https://dataorigami.net/2013/10/14/Multi-Armed-Bandits.html
- stylesheet
  /assets/main.css

Emails

[email protected]

camdavidsonpilon.github.io/2013/10/14/Multi-Armed-Bandits.html

Linked Hostnames

Search Engine Appearance

Google

Multi-Armed Bandits

Bing

Multi-Armed Bandits

DuckDuckGo

Multi-Armed Bandits

General Meta Tags

Open Graph Meta Tags

Twitter Meta Tags

Link Tags

Emails

Links