GithubHelp home page GithubHelp logo

khrylx / pytorch-rl Goto Github PK

View Code? Open in Web Editor NEW
1.0K 27.0 184.0 31.23 MB

PyTorch implementation of Deep Reinforcement Learning: Policy Gradient methods (TRPO, PPO, A2C) and Generative Adversarial Imitation Learning (GAIL). Fast Fisher vector product TRPO.

License: MIT License

Python 100.00%
reinforcement-learning policy-gradient pytorch-rl proximal-policy-optimization trpo ppo pytorch a2c generative-adversarial-network fisher-vectors

pytorch-rl's Introduction

Ye Yuan's GitHub stats

pytorch-rl's People

Contributors

khrylx avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pytorch-rl's Issues

Memory leak during GPU training

Hi,

I'm not sure if this is a PyTorch specific issue but during training my GPU memory is increasing immensely. For my experiment I'm using a high number of samples to optimize my discriminator and policy (~ 4.000 samples, where state is a 51 dimensional vector and action is a 2 dimensional vector). The network trains fine at first but runs into a out of memory exception after a few epochs. This happens for both TRPO and PPO.

I can reproduce this issue by using your gail.gail_gym script and changing the argument min-batch-size to 100.000 (the problem probably exists with smaller batch sizes as well - but this makes the increase more obvious).

The memory consumption starts with 900 MB after the first epoch (which seems reasonable) but increases to 1.300 MB after the third and to 1.700 after epoch ~ 160. Note that the increase is not constant but happens at once after an arbitrary number of epochs.

I tried to del variables after usage but this does not solve it.

I'm using CUDA 7.5 and PyTorch 3.0.

Thank you very much!

Question on multiprocessing

Hi @Khrylx, thank you so much for this great repository!
But I found in the Readme:

PyTorch will create additional threads when performing computations which can damage the performance of multiprocessing. This problem is most serious with Linux, where multiprocessing can be even slower than a single thread

and found here you only set one process.
I'm a little bit confused, do you recommend using multiple processes to train the network?
But if it only uses one process to train the network, what's its advantage over memory buffer?
Thank you!

Mountain Car

Thanks for open sourcing this, this is very good stuff.

However, the code doesn't seem to work on mountain car env?
Maybe is it because I have only 2600 expert state/action pairs?

Entropy Term for GAIL

In the paper https://arxiv.org/pdf/1606.03476.pdf, (Jonathan Ho GAIL's paper), there is a Causal Entropy term for Policy network update step. I could not find any where in your code, anything related to that entropy. Did you skip the entropy term? Would it not completely change the objective function, the whole derivation depended on Maximum Causal Entropy IRL, which just became a kind of regularizer at the end.

I think if Causal Entropy term is included, then instead of maximum entropy IRL you are doing something else implicitly, which may be wrong/suboptimal? Correct me if I'm wrong.

About the computation of Advantage and State Value in PPO

In your implementation of Critic, you feed the network of the observation and action and output 1-dim value. Can I make the inference that It is Q(s,a) ?
But the advantage you given is
values = self.critic_target(states_var, actions_var).detach() advantages = rewards_var - values
It is the estimation of q_t minus Q(s_t,a)
I think it should be Advantage = q_t - V(s_t)

CudnnRNN is not differentiable twice

Hi,

First of all thank you very much for providing this great repository!

I am currently implementing GAIL using a MLP as well as an RNN policy net (Two different experiments). The MLP network is working as intended but if I switch to the RNN policy I get a RuntimeError: CudnnRNN is not differentiable twice during execution of this line in core.trpo.Fvp_fim:

Jv = torch.autograd.grad(Jtv, t, retain_graph=True)[0]

The only difference between my MLP and RNN policy implementation is the initialization of the hidden state during get_log_prob and get_fim within my policy class.

Given your recent commit (d66765eecad38ddc3f6e0f33d35ef70a7ed11892) I thought that the network is only differentiating once during TRPO.

Am I doing something wrong or is the network still differentiating twice ?

Thank you very much!

Various questions?

Hi,

Thanks a lot for this extremely useful implementation.

I wanted just to ask what is the ZFilter class, is it used to standardize the observed state according to the running mean and std of the observed states?

In addition, in the GAIL paper, they consider in the TRPO update a step in the direction of the gradient of the entropy. Is it considered here? I am not managing to find it in the code.

Thank you in advance.

Luca

Concatenation of memories with not terminated episode

mask = 0 if done else 1

Hi,

Thank you for your code that is really well written! From my understanding, mask is 0 at the end of an episode and 1 otherwise. But, there will be a problem if you concatenate a memory M1 (where the last episode is not terminated) with a memory M2 because after concatenation, the computation of returns will be wrong.

To correct this, I think mask should be 0 at the end of an episode or if num_steps = min_batch_size - 1.

Lucas

Autograd Import Error

Hi @Khrylx !
I'm trying your code out of the box by running python examples/ppo_gym.py --env-name Hopper-v1
I get the following error -

Traceback (most recent call last):
File "examples/ppo_gym.py", line 9, in
from utils import *
File "/home/aseem/treeqn_atreec/PyTorch-RL/utils/init.py", line 3, in
from utils.torch import *
File "/home/aseem/treeqn_atreec/PyTorch-RL/utils/torch.py", line 3, in
from torch.autograd import Variable
ImportError: No module named autograd

Could you help me out with it? I suspect there is an issue with setting up PYTHONPATH variable correctly.

TRPO: KL Divergence Computation

I see how KL divergence is computed here:
def get_kl(self, x): action_prob1 = self.forward(x) action_prob0 = action_prob1.detach() kl = action_prob0 * (torch.log(action_prob0) - torch.log(action_prob1)) return kl.sum(1, keepdim=True)

Isn't this wrong? shouldn't the KL divergence be computed for new policy and old policy? Right now it seems the action_prob1, action_prob0 are same, so KL divergence will always be zero, isn't it?

Implementation problem

Hi Dr Yuan, firstly, thank you for your great work!
However, when I implement PPO, the multiprocessing has a problem on my desktop. I'm using python3.6 with PyTorch 1.7.1 on Ubuntu 20.04 with CartPole-v0. In the first loop of i_iter in main_loop in ppo_gym.py, when evaluating the trained model, the code stucks.
After testing, I find it is because in process-4 (one process created here, the action cannot be computed here which is weird. I didn't change anything, and I don't think the multiprocessing should affect the computing with the network. Do you have any idea why this happens?

Why does GAIL get lower rewards the more it is trained?

Hi, thank you for the baseline code, it helps me a lot. But I have a little problem with running it. I first sample data through the trained expert strategy, and then provide it to GAIL, but in the environments of Ant-v2 and Hopper-v2, the rewards will get lower and lower as the number of training increases. My environment is mujoco.py=2.0.8 and mujoco200. I would be very grateful if you could take the time to look into the problem for me.
16571687510554_ pic
16401687509779_ pic

Inconsistent action shape when running CartPole-v1

Gym env: CartPole-v1
Affected code
File: gailgym.py

    """update discriminator"""
    for _ in range(1):
        expert_state_actions = torch.from_numpy(expert_traj).to(dtype).to(device)

        g_o = discrim_net(torch.cat([states, actions], 1))
        e_o = discrim_net(expert_state_actions)
        optimizer_discrim.zero_grad()
        discrim_loss = discrim_criterion(g_o, ones((states.shape[0], 1), device=device)) + \
            discrim_criterion(e_o, zeros((expert_traj.shape[0], 1), device=device))
        discrim_loss.backward()
        optimizer_discrim.step()

Error

 g_o = discrim_net(torch.cat([states, actions], 1))
 RuntimeError: invalid argument 0: Tensors must have same number of dimensions: got 2 and 1 at /opt/conda/conda-bld/pytorch-cpu_1544218188686/work/aten/src/TH/generic/THTensorMoreMath.cpp:1324

To reproduce this error:
python ./gail/save_expert_traj.py --model-path assets/learned_models/CartPole-v1_trpo.p --env-name CartPole-v1 --save-model-interval 100
python ./gail/gail_gym.py --env-name CartPole-v1 --expert-traj-path assets/expert_traj/CartPole-v1_expert_traj.p

This happended because CartPole-v1 's action is discrete, hence:
state = [[0.1,0.2,0.3,0.4],[0.1,0.2,0.3,0.4]]
action = [1,0]
When performing torch.cat thrown this error
Fix suggestion

  """update discriminator"""
    for _ in range(1):
        expert_state_actions = torch.from_numpy(expert_traj).to(dtype).to(device)

        if len(actions.shape) == 1:
            g_o = discrim_net(torch.cat([states, actions.unsqueeze(1)], 1))
        else:
            g_o = discrim_net(torch.cat([states, actions], 1))
        e_o = discrim_net(expert_state_actions)
        optimizer_discrim.zero_grad()
        discrim_loss = discrim_criterion(g_o, ones((states.shape[0], 1), device=device)) + \
            discrim_criterion(e_o, zeros((expert_traj.shape[0], 1), device=device))
        discrim_loss.backward()
        optimizer_discrim.step()

save_expert_traj doesn't cause error since np.hstack stack element instead concating them.
expert_traj.append(np.hstack([state, action]))

Confusion about advantage computation

Hey!
I'm a bit confused about why in the code to compute advantages, the previous advantage value is being set to the value of the first env's advantage from the previous time step, ie advantages[i, 0]
(assuming that advantages are structured in dimension/size as (time_steps X num_envs X 1 ))

prev_value = values[i, 0]

Could you link the source for the equations for this whole function?
Thanks!
Gunshi

A question bout PPO implementation

Hi Dr. Yuan,

I successfully applied D3QN on a robotics navigation task using images. However, when I use your implemented PPO to deal with this task, it seems to not work with similar hyper-parameters as D3QN (like learning rate, gamma, network structure, and etc.).

Actually, your implementation is quite great and has shown amazing performance in multiple other tasks. And for this navigation task, the environment settings are the same as when using D3QN. Do you have any suggestions on this situation that PPO cannot be implemented on this specific task? What kinds of hyper-parameters and factors should I pay attention to?

I'm so sorry for this inconvenience and thank you for your time!

Example for Continued PPO training after GAIL?

Thank you so much for sharing these implementations in PyTorch with the community. I was curious if you have an example to continue PPO training with a saved model from the GAIL process?

CNN Policy

Can you please add an example of a CNN policy? All the code is oriented towards MLP policies.

Doubt regarding the calculation of advantage

Hey, thanks for this great repository! I am a beginner in RL and I am trying to understand the practical implementation of TRPO.
What is the purpose of multiplying the variable 'mask' while computing advantage (in estimate_advantage() function)?
And what range of values do 'masks' take?

Thanks!

question about weight init

Hi,
Your implementation is great and easy to read.
I just had one question though, from the line:

self.action_mean.weight.data.mul_(0.1)

Is there any particular reason why the weights are initialized like that(instead of the normal gaussian/xavier's scheme) with that specific scale?
Thanks!
Gunshi

Failed on " HopperEnv' object has no attribute 'seed' "

Traceback (most recent call last):
File "/home/zjt/exp/PyTorch-RL/examples/ppo_gym.py", line 70, in
env.seed(args.seed)
File "/home/zjt/anaconda3/envs/mujoco/lib/python3.9/site-packages/gym/core.py", line 241, in getattr
return getattr(self.env, name)
File "/home/zjt/anaconda3/envs/mujoco/lib/python3.9/site-packages/gym/core.py", line 241, in getattr
return getattr(self.env, name)
File "/home/zjt/anaconda3/envs/mujoco/lib/python3.9/site-packages/gym/core.py", line 241, in getattr
return getattr(self.env, name)
AttributeError: 'HopperEnv' object has no attribute 'seed'

When "python examples/ppo_gym.py --env-name Hopper-v2", the program encountered an error with env.seed(args.seed), and even after commenting out this line, it ran into another error related to 'seed': AttributeError: 'numpy.random._generator.Generator' object has no attribute 'seed'.

TRPO,Is fixed_log_probs the same as log_probs

in trpo,Is fixed_log_probs the same as log_probs? In the program debugging process, the output of the two is the same, there is no difference between pnew and pold?

 with torch.no_grad():
        fixed_log_probs = policy_net.get_log_prob(states, actions)
    """define the loss function for TRPO"""
    def get_loss(volatile=False):
        with torch.set_grad_enabled(not volatile):
            log_probs = policy_net.get_log_prob(states, actions)
            action_loss = -advantages * torch.exp(log_probs - fixed_log_probs)

Training a recurrent policy

I am still struggling with the implementation of a recurrent policy. The trick from #1 worked and I can now start running my RNN GAIL Network. But no matter what I try the mean reward is actually decreasing over time.

I am currently using the same ValueNet and Advantage Estimation as in the repository.

Do I have to change something in trpo_step in order to make RNN Policies work?

Thank you so much!

About computing Hessian*vector

Excuse me,in TRPO's code,In the following definitions: def Fvp_direct(v):,What is the input v?and how to get it? thanks for your help!

Few Runtime errors

I received Runtime errors(invalid value in reduce).
I think it's better to use BCEwithlogits as loss criterion for discriminator, its numerically stable.

about the kl

def get_kl(self, x):
    action_prob1 = self.forward(x)
    action_prob0 = action_prob1.detach()
    kl = action_prob0 * (torch.log(action_prob0) - torch.log(action_prob1))
    return kl.sum(1, keepdim=True)

Shouldn't kl be two different strategies? There action_prob1 == action_prob0?? Thank you

Fail to train of GAIL in Ant-v2 environment

I trained your ppo first.

python examples/ppo_gym.py --env-name Ant-v2 --save-model-interval 100

After 500 episodes, I made trajectories.

python gail/save_expert_traj.py --model-path assets/learned_models/Ant-v2_ppo.p

Last, I ran gail.

python gail/gail_gym.py --env-name Ant-v2 --expert-traj-path assets/expert_traj/Ant-v2_expert_traj.p

I implemented Gail and Vail, but I failed to train it too.(but hopper worked well)

Any Ideas?

question about A2C

Did you try training A2C agent on Swimmer environment? I was not able to train it! Tested many NN parameters, but was unsuccessful.

How are we using rewards in imitation learning?

Hi, these implementations are amazing, thank you for sharing them. I have a question on how, and rather why are we using rewards in imitation learning?

rewards = torch.from_numpy(np.stack(batch.reward)).to(dtype).to(device)

In the paper they have mentioned that instead of using the rewards to improve the policy, we use the log of the discriminator value like so (last line before end of for loop):
Screenshot (283)
Screenshot (282)

As you can see above the policy update uses the log of the Discriminator. Could you please explain why is this term being used instead of the reward?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.