Reinforcement Learning Replications is a set of Pytorch implementations of reinforcement learning algorithms.

License: MIT License

Python 96.78% HTML 3.22%

deep-learning pytorch reinforcement-learning

reinforcement-learning-replications's People

Contributors

Stargazers

Watchers

reinforcement-learning-replications's Issues

Rethink training design

I'm thinking using rl-replicas on this project.
https://github.com/yamatokataoka/learning-from-human-preferences

In that case, learn function should be customisable so swapping a reward function provided from an OpenAI gym environment with a learned reward function.

I also don't think Inheritance is easy to understand and extensible for the future RL implementations.

Todos

Refactor collect_one_epoch_experience functions
- #64
- #66
Use the same data class for OneEpochExperience #71
Extract collect_one_epoch_experience functions as a sampler

The point is how much I can bring these two functions on on_policy_algorithm.py and off_policy_algorithm.py become closer together.

Separate policy and value function for more separation of concerns

Now, the VPG model has ActorCriticPolicy for actor (policy) and critic (value function).
It's good to separate them for the separation of concerns and easy to extend policy and value functions in the near future.

Initial commit for TRPO

Implement TRPO algorithm.

Refactor collect_one_epoch_experience function in OnPolicyAlgorithm

Refer #59

Add save_model function

Saving model functionality should be a separated function.

Add linters

Lint code before increasing the amount of code

Lint code with isort, black, flake8, mypy
Automate checking and linting

Create MLP class

It makes easy to read code and folder structure consistent.

Refactor mypy

Use built-in types in Python 3.10

Related
#75

Change log level warning to debug

some warning logs should be debug

How to run OpenAI Spinning Up on Colab

note here how to run RL experiment with Spinning Up on Colab to compare results between Spinning Up's and mine.

Implement DDPG

Hold np.ndarray not list in Experience

Value loss is Increasing

The value loss is increasing while training compared to OpenAI Spinning Up implementation.

the current implementation is increasing from 178 to 451, whileSpinning Up's value loss is decreasing from 253 to 171 for example.

for now, I confirmed below parameters are the same between the implementations:

the number of the network parameters: policy: 4610, value function: 4545
network architecture (two hidden layers with 64 units)
total environment interactions
number of value function updates
learing rate both on policy and value function
gym environment: CartPole-v0

Setup for packaging

Refactor comments

Implement TD3

Implement Twin Delayed Deep Deterministic Policy Gradient (TD3).

Update gym env version

Use same data class for OneEpochExperience

Should use the same data class.
#59 (comment)

Refactor Experience calculations

Provide some useful functions to return data. For example, returning flatted observations.

Add integration test

Add integration test for all RL algorithms

VPG
TRPO
PPO
DDPG
TD3

Test environments with either discrete or continuous action spaces.

Adopt src layout

I'm gonna use the src layout which is recommended in a pytest document.

This layout prevents a lot of common pitfalls and has many benefits such as:

I can force to test the installed version of our package, not the local code from the repository,
good tool integration with tox, pytest which I'm planning to use

You can find more on:

How to run tox in Colab

tox --workdir /content/tox

How to print out logs with pytest and tox

you don't need to do anything with tox but pytest.

https://www.stefaanlippens.net/pytest-disable-log-capturing.html
use these options

pytest --capture=no --log-cli-level=INFO

I found it's still printing captured logs.
in that case, use --show-capture=no

and change the log format with --log-cli-format
https://docs.pytest.org/en/latest/how-to/logging.html

Support environments with continuous action spaces for on-policy algorithms

Implement the Gaussian policy as one of the two most common kinds of stochastic policies.

Improve packaging and release flow

transition from setup.py to pyproject.toml and setup.cfg to avoid executing arbitrary scripts and boilerplate code.
single-source the package version
set up GitHub Actions for PyPI and TestPyPI

Improve packaging

transition from setup.py to pyproject.toml and setup.cfg to avoid executing arbitrary scripts and boilerplate code.
single-source the package version

Bump up Python version

Bump up Python version from 3.6 to 3.10

Extract seeding as SeedManager

Rename architecture to hidden_sizes

Run LunarLander-v2 experiments as benchmark

experiment stats

epochs (number of updates policy): 1000
steps_per_epoch: 4000
n_value_gradients: 80
gamma: 0.99
gae_lambda: 0.97
value function learning_rate: 1e-3
policy learning_rate: 3e-4
seed: (0 ~ 2)

Bump down Python version

in #75, I bumped up the version to 3.10 but I found it's inconvenience when I run it in Colab.
As of 7.2022, the Python version in Colab is 3.7.
Thus, I will bump down to 3.7.

Implement storing pytorch model for inerence

At the end of the training, it will store policy model for inference later.

Performance Issue on Vanilla Policy Gradient

My initial implementation of the VPG tooks around 300 sec to excute 200,000 environment interations.
It's about 2x slower than OpenAI Spinning Up implementation.

ran this with cProfile module.

import gym
from rl_replicas.vpg.vpg import VPG
from rl_replicas.common.policies import ActorCriticPolicy

env = gym.make('CartPole-v0')

model = VPG(ActorCriticPolicy, env, seed=1)

model.learn()

07102020_performance_spiningup_vpg.txt

07102020_performance_rl_replicas_vpg.txt