This repository contains a code realized as part of the Multi-Agent System final exam of the Master's degree in Artificial Intelligence, Vrije Universiteit Amsterdam.
It aims to solve a reinforcement learning problem where, given a Gridworld of of size 9 ร 9 represented in figure below, an agent should maximize the long-term reward moving around. The absorbing states in this environment are the cell in position (6,5) with a negative reward of -50 and the cell in position (8,8) with a negative reward of +50. Going in all other cells give a immediate reward of -1. It is not allowed for the agent to jump into a wall (white cells in the figure) or the grid-borders, so every action that leads there leaves the state unchanged.
In order to learn the optimal policy to follow, the agent have to act in the world to gain experience. In the next section I will discuss 3 solution for this model-free problem: Monte Carlo policy evaluation, SARSA and Q-learning.
The encoding of the Gridworld is crucial to make the learning process faster and clearer. Therefore I defined the class GridWorld
, that contain all the methods needed to manipulate the environment and the Monte Carlo evaluation, the Sarsa and the Q-Learning algorithm. The most important attributes of this class are:
-
actions: dictionary representing the correspondence between the index and the name of each action
actions = { 0: 'UP', 1: 'RIGHT', 2: 'DOWN', 3: 'LEFT' }
-
rewards: 2D matrix 9 ร 9 containing all the immediate rewards for each state
-
transactions: 3D matrix 9 ร 9 ร 4 where each cell represent the coordinate of the next state given by the deterministic transaction following the action specified by the layer
-
policy: 2D matrix 9 ร 9 where each cell represent action given by the policy for the correspondent state
With this representation is straightforward to obtain the next state and the immediate reward during the generation of the episodes. At the start of the game, the Gridworld is initialized. Size, obstacles and terminal states positions must be given. At this point, the software will create an instance of Gridworld that will be used for each learning algorithm.
I defined also the class Experience
for a compact representation of the learning experience. It contain in a unique object the 4 elements that express an experience: state, action, reward, next state.
The Monte Carlo evaluation algorithm used to evaluate the policy is standard, no specific tricks have been taken and it has been implemented following the pseudocode in the Sutton & Barto book [1]. The parameters were set as follows:
-
number of episodes: 10000. This number has been chosen, making different tries and searching for a good result and a relatively low number of episodes
-
policy: the policy (
$\pi$ ) to evaluate is equiprobable, so each action have the same probability equal to$\frac{1}{4}$ -
discount factor: the discount factor is set to 1. In this way the agent should not mind about the non-terminal rewards, but instead it should aim at searching the final big prize of +50.
Theoretically we know that the value function converges asymptotically if every (s,a) is visited infinitely often. For this reason the solution implemented use the exploring start approach: before generating an episode, the starting state is chosen randomly between all states that are not terminal or obstacles. Moreover both first-visit MC and ever-visit MC methods are been implemented and one can select the version changing the value of the boolean parameter every_state of the method monte_carlo_equiprobable_policy_evaluation()
.
The state value function
From what is known in theory, these two algorithms converge both to
In order to find the optimal policy, a 3D matrix of size 9 ร 9 ร 4 representing
-
greedification, meaning deriving the
$\epsilon$ -greedy policy that maximize the action value for each state$s=(i,j)$ : with probability$1-\epsilon$ the policy$\pi(s)=argmax_a ; q(s,a)$ and with probability (\epsilon) the policy (\pi(s)) is chosen randomly -
the SARSA algorithm to evaluate the policy
$\pi$ and improve the estimation of the state-action value function$q_{\pi}$
These 2 steps successively repeated enough allow to converge toward $q^$ and $\pi_$.
The update rule used by the SARSA algorithm is the following: $$ q(s, a) \leftarrow q(s, a)+\alpha\left[r+\gamma q\left(s^{\prime}, a^{\prime}\right)-q(s, a)\right] $$ The implementation of the method sarsa_algorithm follow the pseudocode written below. As done for Monte Carlo policy evaluation, before generating the first experience (s,a,r,sโ,aโ) in each episode, the starting state s is chosen randomly between all states that are not terminal obstacles.
The parameters used for the method sarsa_algorithm are listed below:
-
discount_factor=1.0 (for the same reason explained in Monte Carlo policy evaluation)
-
num_episodes=10000 (number of episodes to use for the training)
-
epsilon=0.2 (value for the (\epsilon)-greedy policy)
-
alpha=0.5 (value used for the learning rate)
From the theory it is well known that the SARSA algorithm converge to the optimal state-action value function with a learning rate
After estimating the $q^$ values using SARSA algorithm, it is possible to obtain the greedy policy taking for each state the action with the maximal state-action value function. Then it is straightforwardto compute the $v^$ matrix using the Bellman equation: $$ v^(s) = max_a ; q^(s,a) $$ The obtained value function matrix with the optimal policy is showed in figure below. Apparently the results seems acceptable, but most likely it is not the real optimal policy since for instance in the state (0,1) make more sense to go right instead to go down.
The implementation of Q-learning algorithm used to search the optimal policy is standard and it follow the pseudocode written in figure below.
The method q_learning_algorithm has the same parameter of the method sarsa_algorithm specified before, and the only differences on the algorithm are:
-
Q-learning estimate directly the optimal value function
$q^*$ without a given policy$\pi$ to improve -
the update rule for Q-learning is:
$$ q\left(s, a\right) \leftarrow q\left(s, a\right)+\alpha\left[r+\gamma \max _{a'} q\left(s', a'\right)-q\left(s, a\right)\right] $$
As done for the results of SARSA, once computing the optimal state-action value function, the optimal policy and the optimal state value function matrix are computed afterwards.
The obtained (v^*) matrix with the optimal policy is showed in figure below. This result is definitely more reliable than the previous one and it is easy to check on the Gridworld that each value function is exactly the total reward following the shortest path to reach the "treasure" absorbing state.
In order to compare the two model-free algorithms implemented, some statistics are been collected during the learning.
During the execution of both SARSA and Q-learning it is observable that the first episodes are slower to compute, but after few hundred iterations the algorithms became faster and faster giving the solution in less than 15 seconds. This behavior could be understood looking to the figure below, that shows the length of each episode during the learning until reaching an absorbing state. In the fist stage, the episodes are very long since the policy is derived from a random initialized
Another observation could be done about the speed of convergence of both algorithms. For each episode, the Temporal Differencing error (TD) in each step was computed and the figure below shows the average and the standard deviation along the training. For both algorithm it oscillates more in the first episodes and than goes closer to 0 in the last episodes. But observing the standard deviation, seems that the Q-learning algorithm converge faster than SARSA. This helps to explain why the results obtained by Q-learning are more accurate.
[1] Richard S. Sutton and Andrew G. Barto. ReinforcementLearning: An Introduction. Second. The MIT Press, 2018. URL: http://incompleteideas.net/book/the-book-2nd.html