This is a solution for the second project of the Udacity deep reinforcement learning course. It includes scripts for training agents using any of the A2C, PPO and TD3 algorithms and for testing it in a simulation environment. The models were trained using the Stable Baselines3 project.
The giff shows the behavior of multiple agents using a model trained using PPO in this codebase. The agent parameters can be found under experiments/ppo_multi_agent_lr_0_00003/model.zip
.
The agent consists of an arm with two joints and the environment contains a sphere which is rotating around the agent. The goal is to keep touching the ball as long as possible during an episode of 1000 timesteps.
-
Rewards:
- +0.04 for each timestep the agent touches the sphere
-
Input state:
- 33 continuous variables corresponding to position, rotation, velocity, and angular velocities of the arm
-
Actions:
- 4 continuous variables, corresponding to torque applicable to two joints with values in [-1.0, 1.0]
-
Goal:
- Get an average score of at least +30 over 100 consecutive episodes
-
Environments: Two environments are available, one with a single agent and one with 20 agents. The evaluation for the 20 agents environment differs in that the reward of each episode is the average of all agent rewards. In training the only difference is that one can practically simulate 20 environments in one to speed up exploration.
The problem is solved with all A2C, PPO and TD3 using the stable baselines framework. For more details and a comparison of the algorithms' behavior look in the corresponding report.
To setup the project follow those steps:
- Provide an environment with
python 3.6.x
installed, ideally create a new one with e.g. pyenv or conda - Clone and install the project:
git clone [email protected]:koulakis/reacher-deep-reinforcement-learning.git
cd reacher-reinforcement-learning
pip install .
- Create a directory called
udacity_reacher_environment_single_agent
orudacity_reacher_environment_multi_agent
(to use with the single or 20 agent environments respectively) under the root of the project and download and extract there the environment compatible with your architecture. You can find the download links here. - Install a version of pytorch compatible with your architecture. The version used to develop the project was 1.5.0.
e.g.
pip install pytorch==1.5.0
To check that everything is setup properly, run the following test which loads an environment and runs a random agent:
python scripts/run_agent_in_environment.py --random-agent
or
python scripts/run_agents_in_environment.py --random-agent --agent-type multi
which run the 20 agents environment.
The project comes along with some pre-trained agents, scripts to test them and train your own.
-
train_agent.py
: This one is used to train an agent. The parameterexperiment-name
is used to name your agent and the script will create by default a directory underexperiments
with the same name. The trained agent parameters will be saved there in the end of the training and during training several metrics are logged to a tfevents file under the same directory. Here is an example call:python scripts/train_agent.py --experiment-name td3_rl_0_001 --agent-type single --learning-rate 0.001 --rl-algorithm td3 --total-timesteps 500000 --environment-port 5005
To monitor the metrics one can launch a tensorboard server with:
tensorboard --logdir experiments
This will read the metrics of all experiments and make the available underlocalhost:6006
One can run multiple trainings in parallel by using different ports per environment with the
environment-port
flag. -
test_agent_in_environment
: This script can be used to test an agent on a given environment. As mentioned above, one can access the saved agent models inside the sub-folders ofexperiments
. An example usage:python scripts/run_agents_in_environment.py --agent-type multi --agent-parameters-path experiments/ppo_multi_agent_lr_0_00003/model.zip --environment-port 5007
Under the experiments
directory there are several pre-trained agents one can used to run in the environment. Some
examples of models which have solved the environment are:
- Best A2C model: a2c_lr_0_0001/tensorboard_logs/A2C_1
- Best PPO model: ppo_multi_agent_lr_0_00003/tensorboard_logs/PPO_1
- Best PPO model trained with a single agent: ppo_large_128_128_128_lr_0_00005_3M_steps/tensorboard_logs/PPO_1
- Best TD3 model trained with a single agent: td3_0_001/tensorboard_logs/TD3_1
Given that this project is an assignment of an online course, it has been influenced heavily by code provided by Udacity and several mainstream publications. Below you can find some links which can give some broader context.
- All 3 algorithms used were trained using the Stable Baselines3 project
- Most of the simulation setup comes from this notebook
- The unity environment created by Udacity is a direct copy from here
The following publications were used:
- Asynchronous Methods for Deep Reinforcement Learning. Mnih, V., Badia, A.P., Mirza, M., Graves, A., Lillicrap, T.P., Harley, T., Silver, D., & Kavukcuoglu, K. arXiv:1602.01783. 2016.
- Proximal Policy Optimization Algorithms. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov. arXiv:1707.06347. 2017.
- High-Dimensional Continuous Control Using Generalized Advantage Estimation. John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, Pieter Abbeel. arXiv:1506.02438. 2015.
- Continuous control with deep reinforcement learning. Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, Daan Wierstra. arXiv:1509.02971. 2015.
- Addressing Function Approximation Error in Actor-Critic Methods. Scott Fujimoto, Herke van Hoof, David Meger. arXiv:1802.09477. 2018.