Multiarmed Bandit (MAB) problems can be categorized as sequential resource allocation tasks, where one or more resources must be chosen wisely and efficiently allocated among competing projects. This must be typically performed in such a way so as to maximize the overall expected gain. The main dilemmain these particular problems is to either naturally choose between possible paths that yield instantly the maximum gain currently (exploitation) or sacrifice current gain over better future gains (exploration). Since strategies for these problems adequately represent a subsection of reinforcement learning methods, the ultimate objective is to achieve the most appropriate balance between exploration and exploitation, consequently maximizing the overall rewards.
In this project, the goal is to implement two basic MAB techniques namely the Epsilon-Greedy approach and Thompson Sampling and do a comparative analysis of how these techniques perform across multiple experiments. The problem is to choose a website among many which guarantees the best overall reward. Here, the reward refers to the total number of clicks gained by the website across all trials.
Muhammad Abdullah Khan