GithubHelp home page GithubHelp logo

yamatokataoka / reinforcement-learning-replications Goto Github PK

View Code? Open in Web Editor NEW
25.0 25.0 1.0 277.17 MB

Reinforcement Learning Replications is a set of Pytorch implementations of reinforcement learning algorithms.

License: MIT License

Python 96.78% HTML 3.22%
deep-learning pytorch reinforcement-learning

reinforcement-learning-replications's People

Contributors

yamatokataoka avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

reinforcement-learning-replications's Issues

Rethink training design

I'm thinking using rl-replicas on this project.
https://github.com/yamatokataoka/learning-from-human-preferences

In that case, learn function should be customisable so swapping a reward function provided from an OpenAI gym environment with a learned reward function.

I also don't think Inheritance is easy to understand and extensible for the future RL implementations.

Todos

  • Refactor collect_one_epoch_experience functions
  • Use the same data class for OneEpochExperience #71
  • Extract collect_one_epoch_experience functions as a sampler

The point is how much I can bring these two functions on on_policy_algorithm.py and off_policy_algorithm.py become closer together.

Add linters

Lint code before increasing the amount of code

  • Lint code with isort, black, flake8, mypy
  • Automate checking and linting

Value loss is Increasing

The value loss is increasing while training compared to OpenAI Spinning Up implementation.

the current implementation is increasing from 178 to 451, whileSpinning Up's value loss is decreasing from 253 to 171 for example.

for now, I confirmed below parameters are the same between the implementations:

  • the number of the network parameters: policy: 4610, value function: 4545
  • network architecture (two hidden layers with 64 units)
  • total environment interactions
  • number of value function updates
  • learing rate both on policy and value function
  • gym environment: CartPole-v0

Implement TD3

Implement Twin Delayed Deep Deterministic Policy Gradient (TD3).

Add integration test

Add integration test for all RL algorithms

  • VPG
  • TRPO
  • PPO
  • DDPG
  • TD3

Test environments with either discrete or continuous action spaces.

Improve packaging and release flow

  • transition from setup.py to pyproject.toml and setup.cfg to avoid executing arbitrary scripts and boilerplate code.
  • single-source the package version
  • set up GitHub Actions for PyPI and TestPyPI

Improve packaging

  • transition from setup.py to pyproject.toml and setup.cfg to avoid executing arbitrary scripts and boilerplate code.
  • single-source the package version

Run LunarLander-v2 experiments as benchmark

experiment stats

  • epochs (number of updates policy): 1000
  • steps_per_epoch: 4000
  • n_value_gradients: 80
  • gamma: 0.99
  • gae_lambda: 0.97
  • value function learning_rate: 1e-3
  • policy learning_rate: 3e-4
  • seed: (0 ~ 2)

Bump down Python version

in #75, I bumped up the version to 3.10 but I found it's inconvenience when I run it in Colab.
As of 7.2022, the Python version in Colab is 3.7.
Thus, I will bump down to 3.7.

Performance Issue on Vanilla Policy Gradient

My initial implementation of the VPG tooks around 300 sec to excute 200,000 environment interations.
It's about 2x slower than OpenAI Spinning Up implementation.

ran this with cProfile module.

import gym
from rl_replicas.vpg.vpg import VPG
from rl_replicas.common.policies import ActorCriticPolicy

env = gym.make('CartPole-v0')

model = VPG(ActorCriticPolicy, env, seed=1)

model.learn()

07102020_performance_spiningup_vpg.txt

07102020_performance_rl_replicas_vpg.txt

Split policies for re-usability and extensibility

The policies should determine types of policy; deterministic or stochastic, and categorical or normal distribution.
It's the preparation of implementing DDPG and continuous action space support for on-policy algorithms.

Implement PPO

Implement Proximal Policy Optimization (PPO) (by clipping)

Set up mypy

  • Set up mypy for type checking.
  • Resolve all errors and warnings from mypy.
  • Remove unnecessary type hints.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.