# Multi Armed Bandit Problem

## What is the Multi Armed Bandit Problem

• A branch of reinforcement learning.
• A one arm bandit is a slot machine in the casino.
• These machines take away your money very quickly. The chances of winning from slot machine are very low.
• The probability of player on winning is less than the probability of player losing.
• Assume that the distribution of return or probability of winning for each slot machine in the casino is different.
• The distribution of return are different for each slot machine, and the player does not know the distribution.
• When a player is playing more than one slot machine ( ex: gamble on 5 slot machines at the same time), we want to know how should the player play them to maximize the return.
• Hence, the longer or the more that the player gambles, the more money wasted on the low return slot machine.
• But if you do not spend enough time exploring, your result might not be real.
• The goal of Multi Armed Bandit is the find the slot machine with the max return as quick as possible.
• This is the challenge that we are going to solve with some simple artificial intelligent methods.
• Upper Confidence Bound
• Thompson Sampling
• If 1 = positive return, the goal is to find the slot machine with distribution mean closest to 1.

## What is Reinforcement Learning

• The Multi Armed Bandit Problem is related to reinforcement learning.
• It is not the only type problem that reinforcement learning can solve. It is just an example
• Likewise, reinforcement learning can solve many kinds of problems.
• For example, Reinforcement learning is used to train robots on how to walk.
• In order for a robot to walk, you can problem it how to walk with a sequence of actions, or you can use reinforcement learning to train the robot to walk in a very interesting way.
• You tell the robot all the actions it can make.
• You tell the robot the goal is to walk forward.
• Whenever the robot moves forward, it will be given a reward (+1), and every time it moves backward, it will be given a punishment (-1 or 0).
• So the robot will try all the random sets of actions, and see what they lead to.
• The robots will remember the sets of actions that leads to a good result, and they will repeat them more often.
• So eventually, it will know how to walk without programmer coding the code on how to walk.
##### Other Topics on Deep Learning :
• Natural Language Processing (NLP)
• Artificial Neural Networks (ANN)
• Convolutional Neural Networks (CNN)
• Recurrent Neural Networks (RNN)
• Self-Organizing Maps (SOM)
• Boltzmann Machines
• Autoencoders
• XGBoost