Bandit Learning Algorithms

Overview

This repository contains the code and report for an assignment on bandit learning algorithms. The goal of this assignment is to experiment with various bandit learning algorithms, implement them from scratch, and compare their performance. The comparison is based on two metrics: (1) the average accumulated reward and (2) the proportion of time the optimal action is taken.

The assignment consists of two parts:

Part 1: Stationary Reward Distributions
Part 2: Non-Stationary Reward Distributions

Repository Structure

.ipynb_checkpoints/: Contains Jupyter notebook checkpoints. rl-a1.pdf: Project description
Part_1_RL_Project1.ipynb: Jupyter notebook for Part 1: Stationary Reward Distributions.
Part2-Drift_change.ipynb: Jupyter notebook for the drift change experiment in Part 2.
Part2_AbruptChange.ipynb: Jupyter notebook for the abrupt change experiment in Part 2.
Part2_mean_reverting_change.ipynb: Jupyter notebook for the mean-reverting change experiment in Part 2.

Requirements

To replicate the results in this repository, you will need the following packages: Python 3.x
NumPy
Matplotlib
Seaborn
Jupyter Notebook

Instructions

Part 1: Stationary Reward Distributions For each method, we repeated the experiment for 1000 different bandit problems (1000 sets of ten mean parameters), and 10,000 steps.

Greedy with Non-Optimistic Initial Values: Initialize the action value estimates to 0. Use the incremental implementation of the simple average method.
Epsilon-Greedy with Different Choices of Epsilon: Experimented with different values of epsilon and used pilot runs to choose the best epsilon value.
Optimistic Starting Values with a Greedy Approach: Assume knowledge of the means of each reward distribution to set optimistic initial values. (5)
Gradient Bandit Algorithm: Experimented with different learning rates (α) to determine the best one. The average reward acquired by the algorithm at each time step.

Part 2: Non-Stationary Reward Distributions

Gradual Changes: Applied a drift change (µt = µt−1 + ϵt) where ϵt = N(0, 0.001²).
Applied a mean-reverting change (µt = κµt−1 + ϵt) where κ = 0.5 and ϵt ~ N(0, 0.01²).
Abrupt Changes: At each time step, with probability 0.005, permuted the means corresponding to each reward distribution.

Evaluation

Compared optimistic greedy method, ϵ-greedy with a fixed step size, and ϵ-greedy with a decreasing step size. Run the algorithm on 1000 repetitions of the non-stationary problem and report the distribution of the average reward attained at the terminal step using box plots. Results

Authors

This project was completed by Himani Thakkar and Parinaz Shiri. If you have any questions or need further assistance, please feel free to contact us.

hiimani28 / rl-project1 Goto Github PK

rl-project1's Introduction

Bandit Learning Algorithms

Overview

Repository Structure

Requirements

Instructions

Evaluation

Authors

rl-project1's People

Contributors

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs