GithubHelp home page GithubHelp logo

quantumiracle / popular-rl-algorithms Goto Github PK

View Code? Open in Web Editor NEW
987.0 14.0 116.0 48.07 MB

PyTorch implementation of Soft Actor-Critic (SAC), Twin Delayed DDPG (TD3), Actor-Critic (AC/A2C), Proximal Policy Optimization (PPO), QT-Opt, PointNet..

License: Apache License 2.0

Python 38.01% Jupyter Notebook 61.99%
reinforcement-learning soft-actor-critic state-of-the-art

popular-rl-algorithms's Introduction

Popular Model-free Reinforcement Learning Algorithms

PyTorch and Tensorflow 2.0 implementation of state-of-the-art model-free reinforcement learning algorithms on both Openai gym environments and a self-implemented Reacher environment.

Algorithms include:

  • Actor-Critic (AC/A2C);
  • Soft Actor-Critic (SAC);
  • Deep Deterministic Policy Gradient (DDPG);
  • Twin Delayed DDPG (TD3);
  • Proximal Policy Optimization (PPO);
  • QT-Opt (including Cross-entropy (CE) Method);
  • PointNet;
  • Transporter;
  • Recurrent Policy Gradient;
  • Soft Decision Tree;
  • Probabilistic Mixture-of-Experts;
  • QMIX
  • etc.

Please note that this repo is more of a personal collection of algorithms I implemented and tested during my research and study period, rather than an official open-source library/package for usage. However, I think it could be helpful to share it with others and I'm expecting useful discussions on my implementations. But I didn't spend much time on cleaning or structuring the code. As you may notice that there may be several versions of implementation for each algorithm, I intentionally show all of them here for you to refer and compare. Also, this repo contains only PyTorch Implementation.

For official libraries of RL algorithms, I provided the following two with TensorFlow 2.0 + TensorLayer 2.0:

  • RL Tutorial (Status: Released) contains RL algorithms implementation as tutorials with simple structures.

  • RLzoo (Status: Released) is a baseline implementation with high-level API supporting a variety of popular environments, with more hierarchical structures for simple usage.

For multi-agent RL, a new repository is built (PyTorch):

  • MARS (Status: WIP) is a library for multi-agent RL on games, like PettingZoo Atari, SlimeVolleyBall, etc.

Since Tensorflow 2.0 has already incorporated the dynamic graph construction instead of the static one, it becomes a trivial work to transfer the RL code between TensorFlow and PyTorch.

Contents:

Usage:

python ***.py --train

python ***.py --test

Troubleshooting:

If you meet problem "Not imlplemented Error", it may be due to the wrong gym version. The newest gym==0.14 won't work. Install gym==0.7 or gym==0.10 with pip install -r requirements.txt.

Undervalued Tricks:

As we all known, there are various tricks in empirical RL algorithm implementations in support the performance in practice, including hyper-parameters, normalization, network architecture or even hidden activation function, etc. I summarize some I met with the programs in this repo here:

  • Environment specific:

    • For Pendulum-v0 environment in Gym, a reward pre-processing as (r+8)/8 usually improves the learning efficiency, as here Also, this environment needs the maximum episode length to be at least 150 to learn well, too short episodes make it hard to learn.
    • MountainCar-v0 environment in Gym has very sparse reward (only when reaching the flag), general learning curves will be noisy; therefore some specific process may also need for this environment.
  • Normalization:

    • Reward normalization or advantage normalization in batch can have great improvements on performance (learning efficiency, stability) sometimes, although theoretically on-policy algorithms like PPO should not apply data normalization during training due to distribution shift. For an in-depth look at this problem, we should treat it differently (1) when normalizing the direct input data like observation, action, reward, etc; (2) when normalizing the estimation of the values (state value, state-action value, advantage, etc). For (1), a more reasonable way for normalization is to keep a moving average of previous mean and standard deviation, to achieve a similar effect as conducting the normaliztation on the full dataset during RL agent learning (this is not possible since in RL the data comes from interaction of agents and environments). For (2), we can simply conduct normalization on value estimations (rather than keeping the historical average) since we do not want the estimated values to have distribution shift, so we treat them like a static distribution.
  • Multiprocessing:

    • Is the multiprocessing update based on torch.multiprocessing the right/safe way to parallelize the code? It can be seen that the official instruction (example of Hogwild) of using torch.multiprocessing is applied without any explicit locks, which means it can be potentially unsafe when multiple processes generate gradients and update the shared model at the same time. See more discussions here and some tests and answers. In general, the drawback of unsafe updates may be overwhelmed by the speed up of using multiprocessing (also RL training itself has huge variances and noise).

    • Although I provide the multiprocessing versions of serveral algorithms (SAC, PPO, etc), for small-scale environments in Gym, this is usually not necessary or even inefficient. The vectorized environment wrapper for parallel environment sampling may be more proper solution for learning these environments, since the bottelneck in learning efficiency mainly lies in the interaction with environments rather than the model learning (back-propagation) process.

    • A quick note on multiprocess usage:

      Sharing class instance with its states across multiple processes requires to put the instance inside multiprocessing.manager:

  • PPO Details:

    • Here I summarized a list of implementation details for PPO algorithm on continous action spaces, correspoonding to scripts ppo_gae_continuous.py, ppo_gae_continuous2.py and ppo_gae_continuous3.py.

More discussions about implementation tricks see this chapter in our book.

Performance:

  • SAC for gym Pendulum-v0:

SAC with automatically updating variable alpha for entropy:

SAC without automatically updating variable alpha for entropy:

It shows that the automatic-entropy update helps the agent to learn faster.

  • TD3 for gym Pendulum-v0:

TD3 with deterministic policy:

TD3 with non-deterministic/stochastic policy:

It seems TD3 with deterministic policy works a little better, but basically similar.

  • AC for gym CartPole-v0:

However, vanilla AC/A2C cannot handle the continuous case like gym Pendulum-v0 well.

  • PPO for gym LunarLanderContinuous-v2:

Use ppo_continuous_multiprocess2.py.

Citation:

To cite this repository:

@misc{rlalgorithms,
  author = {Zihan Ding},
  title = {Popular-RL-Algorithms},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/quantumiracle/Popular-RL-Algorithms}},
}

Other Resources:

Deep Reinforcement Learning: Foundamentals, Research and Applications Springer Nature 2020

is the book I edited with Dr. Hao Dong and Dr. Shanghang Zhang, which provides a wide coverage of topics in deep reinforcement learning. Details see website and Springer webpage. To cite the book:

@book{deepRL-2020,
 title={Deep Reinforcement Learning: Fundamentals, Research, and Applications},
 editor={Hao Dong, Zihan Ding, Shanghang Zhang},
 author={Hao Dong, Zihan Ding, Shanghang Zhang, Hang Yuan, Hongming Zhang, Jingqing Zhang, Yanhua Huang, Tianyang Yu, Huaqing Zhang, Ruitong Huang},
 publisher={Springer Nature},
 note={\url{http://www.deepreinforcementlearningbook.org}},
 year={2020}
}

popular-rl-algorithms's People

Contributors

dependabot[bot] avatar jieren98 avatar quantumiracle avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

popular-rl-algorithms's Issues

“expected sequence of length 4 at dim 1”when running dqn.py

Hello, I encountered the following error while running dqn.py. It seems that the error is due to the state s not being correctly initialized as a valid sequence containing four elements. I attempted to modify the variable related to the initialization of s, "self.current_frame_idx=0", to "self.current_frame_idx =[0, 0, 0, 0]". However, I still receive the same error. Could you please advise me on the correct way to modify the code?

UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\utils\tensor_new.cpp:278.)
  x = torch.unsqueeze(torch.FloatTensor(x), 0).to(device)
Traceback (most recent call last):
  File "D:\wush-group\PythonProject\Popular-RL-Algorithms-master\dqn.py", line 269, in <module>
    rollout(env, model)
  File "D:\wush-group\PythonProject\Popular-RL-Algorithms-master\dqn.py", line 246, in rollout
    a = model.choose_action(s)
  File "D:\wush-group\PythonProject\Popular-RL-Algorithms-master\dqn.py", line 160, in choose_action
    x = torch.unsqueeze(torch.FloatTensor(x), 0).to(device)
ValueError: expected sequence of length 4 at dim 1 (got 0)

Variable length episodes

I tried to use you code on a custom environment that can have vaiable length episodes . Maybe I have set it up wrong but I can't figure out how it can work. The replay buffer is filled from complete episodes (not just state transitions) but in update in the td3_lstm it samples those episodes and uses torch.FloatTensor(state).to(device) to put it to the GPU but this wont work as the batch can have varying length episodes and pytorch wont allow this. Possibly works on a batch size of 1

I think "with torch.no_grad():" is needed when calculating critic loss

In file sac_v2_lstm.py, the code below without with torch.no_grad(): when calculating target Q value:

    # Training Q Function
    #  I think `with torch.no_grad(): ` is needed
        predict_target_q1, _ = self.target_soft_q_net1(next_state, new_next_action, action, hidden_out)
        predict_target_q2, _ = self.target_soft_q_net2(next_state, new_next_action, action, hidden_out)
        target_q_min = torch.min(predict_target_q1, predict_target_q2) - self.alpha * next_log_prob
        target_q_value = reward + (1 - done) * gamma * target_q_min # if done==1, only reward

        q_value_loss1 = self.soft_q_criterion1(predicted_q_value1, target_q_value.detach())  # detach: no gradients for the variable
        q_value_loss2 = self.soft_q_criterion2(predicted_q_value2, target_q_value.detach())

Though, for some simple environment, the algorithm would finally converge, but I am not sure the case in complex environment. As far as I know, the parameter for calculating target_q_value should not contribute any gradient to model.

NameError: name 'last_action' is not defined

Hi,

After training sac_lstm by running python3 sac_v2_lstm.py --train , I then attempted to test. Upon testing, I received an error stating name "last_action is not defined". More specifically:

gym/envs/registration.py:14: PkgResourcesDeprecationWarning: Parameters to load are deprecated.  Call .resolve and .require separately.
  result = entry_point.load(False)
Soft Q Network (1,2):  QNetworkLSTM(
  (linear1): Linear(in_features=4, out_features=512, bias=True)
  (linear2): Linear(in_features=4, out_features=512, bias=True)
  (lstm1): LSTM(512, 512)
  (linear3): Linear(in_features=1024, out_features=512, bias=True)
  (linear4): Linear(in_features=512, out_features=1, bias=True)
)
Policy Network:  SAC_PolicyNetworkLSTM(
  (linear1): Linear(in_features=3, out_features=512, bias=True)
  (linear2): Linear(in_features=4, out_features=512, bias=True)
  (lstm1): LSTM(512, 512)
  (linear3): Linear(in_features=1024, out_features=512, bias=True)
  (linear4): Linear(in_features=512, out_features=512, bias=True)
  (mean_linear): Linear(in_features=512, out_features=1, bias=True)
  (log_std_linear): Linear(in_features=512, out_features=1, bias=True)
)
Traceback (most recent call last):
  File "sac_v2_lstm.py", line 308, in <module>
    action, hidden_out = sac_trainer.policy_net.get_action(state, last_action, hidden_in, deterministic = DETERMINISTIC)
NameError: name 'last_action' is not defined

For now I added last_action = env.action_space.sample() in line 302 after the else statement, however, I am unsure if this is correct.

Lastly, I just had a question with regards to the "NormalizedActions" class in line 50 sac_v2_lstm.py. How will this be useful if the action space is for example [-1,+1] but normalisation scales to [0,1], is there any reference text for this motivation?

Thanks

Stochastic Action sample seems not right to me

In "SOTA-RL-Algorithms/common/policy_networks.py" line 337, the random sampling from the normal distribution is done by z = normal.sample().
While this samples a single value, it works for a single action space. For multiple action space, this does not seems right. What about z = normal.sample_n(self._action_dim) ?

ValueError on SAC v2 LSTM

I am getting these errors. Any idea why?

    done = T.tensor(np.float32(done)).unsqueeze(-1).to(self.policy_net.device)
ValueError: setting an array element with a sequence.
    reward = T.tensor(reward, dtype=T.float).unsqueeze(-1).to(self.policy_net.device)
ValueError: expected sequence of length 23 at dim 1 (got 48)

Does sac_v2_lstm support Pendulum-v0?

I am trying to run sac_v2_lstm.py with python sac_v2_lstm.py --train.

I change the 'ENV = ['Reacher', 'Pendulum-v0', 'HalfCheetah-v2'][2]
' in original code to 'ENV = ['Reacher', 'Pendulum-v0', 'HalfCheetah-v2'][1]'.

The error i got

action | [0.20995763] Traceback (most recent call last): File "sac_v2_lstm.py", line 267, in <module> next_state, reward, done, _ = env.step(action) File "/usr/local/anaconda3/lib/python3.8/site-packages/gym/core.py", line 292, in step return self.env.step(self.action(action)) File "/usr/local/anaconda3/lib/python3.8/site-packages/gym/core.py", line 295, in action raise NotImplementedError NotImplementedError

Is it because for the POMDP we need particularly gym env?

Error:ppo_gae_discrete.py

samples_2d = torch.multinomial(probs_2d, sample_shape.numel(), True).T
RuntimeError: invalid multinomial distribution (encountering probability entry < 0)

Issue in test mode of 'sac_v2_gru.py'

Hi, thank you for your awesome GRU based SAC!
I found that from the line 303 to 304 in 'sac_v2_gru.py' should be modified as it is GRU implementation.

Location :
https://github.com/quantumiracle/SOTA-RL-Algorithms/blob/9856600d19f2ed787094f7a968e8588cfead1a21/sac_v2_gru.py#L303

Modification:

        # hidden_out = (torch.zeros([1, 1, hidden_dim], dtype=torch.float).cuda(), \
        #     torch.zeros([1, 1, hidden_dim], dtype=torch.float).cuda())  # initialize hidden state for lstm, (hidden, cell), each is (layer, batch, dim)             
        hidden_out = torch.zeros([1, 1, hidden_dim], dtype=torch.float).cuda()

Thanks!

Missing folders

Could you push the folder for gym-pomdp-wrappers and gym_pomdp?

Random actions at the beginning for recurrent policy

Hello,

I wonder what you would implement to achieve the above thing. It is easy in the mdp policy but for recurrent policies, how we can get the hidden_out if we take a random action? For instance, if we take the action from the policy_net then we can get hidden_out but what if we take a random action?
action, hidden_out = sac_trainer.policy_net.get_action(obs, last_action, hidden_in, deterministic = environment["deterministic_train"])

About RL+LSTM

Hello, Markov property means that the current action of the agent is only related to the current state s_t, but the input of the policy network in your open-source code rl+lstm is s_ t and a_ t-1, does this type of algorithm converge in the training process?
Thank you again for your open-source code.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.