khrylx / pytorch-rl Goto Github PK

PyTorch implementation of Deep Reinforcement Learning: Policy Gradient methods (TRPO, PPO, A2C) and Generative Adversarial Imitation Learning (GAIL). Fast Fisher vector product TRPO.

License: MIT License

Python 100.00%

reinforcement-learning policy-gradient pytorch-rl proximal-policy-optimization trpo ppo pytorch a2c generative-adversarial-network fisher-vectors deep-reinforcement-learning

pytorch-rl's Introduction

PyTorch implementation of reinforcement learning algorithms

This repository contains:

policy gradient methods (TRPO, PPO, A2C)
Generative Adversarial Imitation Learning (GAIL)

Important notes

The code now works for PyTorch 0.4. For PyTorch 0.3, please check out the 0.3 branch.
To run mujoco environments, first install mujoco-py and gym.
If you have a GPU, I recommend setting the OMP_NUM_THREADS to 1 (PyTorch will create additional threads when performing computations which can damage the performance of multiprocessing. This problem is most serious with Linux, where multiprocessing can be even slower than a single thread):

export OMP_NUM_THREADS=1

Features

Support discrete and continous action space.
Support multiprocessing for agent to collect samples in multiple environments simultaneously. (x8 faster than single thread)
Fast Fisher vector product calculation. For this part, Ankur kindly wrote a blog explaining the implementation details.

Policy gradient methods

Example

python examples/ppo_gym.py --env-name Hopper-v2

Reference

Generative Adversarial Imitation Learning (GAIL)

To save trajectory

python gail/save_expert_traj.py --model-path assets/learned_models/Hopper-v2_ppo.p

To do imitation learning

python gail/gail_gym.py --env-name Hopper-v2 --expert-traj-path assets/expert_traj/Hopper-v2_expert_traj.p

pytorch-rl's People

Contributors

Stargazers

Watchers

Forkers

amoliu keniuniu crystalbai fernandocamaro wujianzhai kinopiiiii wranai g-wang jdc08161063 lgcming jonenash lucas2012 kwnsiy gvsi jerryxiaoyu zephirefaith anthonysimeonov bhargav104 nke001 yupenggao flydsc lengstrom chanr-analytics cn3c3p billmatrix jperl leomauro xiaoiker wizdom13 wh-forker doeljh pvr1 collector-m dganbold dantodor watchernyu 170928 b-kartal saminyeasar vvanirudh vin136 johndpope amoshua levifussell weify627 xuhuiren sandeepnres lanseyege leobix avidkenil sweetice allensmile phxmark newenglandml hazekiahwon ralphhan avijit9 mohitzsh mbchang gjzheng93 weituo12321 nooralahzadeh marianophielipp luciferkonn little1tow zcrwind hyzcn syuntoku14 psyche-mia gunshi wook133 mudiaoxu doandongnguyen chenlk96 zizai aiguoth nagisazj fcdtc shenzebang we1zhaoji trevorablett shercklo hustacds afcarl 1036225283 ralami1859 gavinljj jeme-yufeng-zhan weiyi-zhang258 lukemshannonhill shisthruna28 zhuyuanyang caozixuan antonizdp xiaoxiu121 haiyinpiao lidaweinuc leey127 xinzhang525 farfarawayzyt

pytorch-rl's Issues

Why does GAIL get lower rewards the more it is trained?

Hi, thank you for the baseline code, it helps me a lot. But I have a little problem with running it. I first sample data through the trained expert strategy, and then provide it to GAIL, but in the environments of Ant-v2 and Hopper-v2, the rewards will get lower and lower as the number of training increases. My environment is mujoco.py=2.0.8 and mujoco200. I would be very grateful if you could take the time to look into the problem for me.

About computing Hessian*vector

Excuse me,in TRPO's code，In the following definitions： def Fvp_direct(v):，What is the input v？and how to get it? thanks for your help!

GAIL discriminator loss uses complete expert data in each iteration?

PyTorch-RL/gail/gail_gym.py

Line 126 in f44b444

discrim_criterion(e_o, zeros((expert_traj.shape[0], 1), device=device))

The number of generator data samples seems to be around 2088, while the number of expert samples is 50000. Shouldn't the number of expert samples be the same as that of generator's?

About the computation of Advantage and State Value in PPO

In your implementation of Critic, you feed the network of the observation and action and output 1-dim value. Can I make the inference that It is Q(s,a) ?
But the advantage you given is
values = self.critic_target(states_var, actions_var).detach() advantages = rewards_var - values
It is the estimation of q_t minus Q(s_t,a)
I think it should be Advantage = q_t - V(s_t)

Example for Continued PPO training after GAIL?

Thank you so much for sharing these implementations in PyTorch with the community. I was curious if you have an example to continue PPO training with a saved model from the GAIL process?

Autograd Import Error

Hi @Khrylx !
I'm trying your code out of the box by running python examples/ppo_gym.py --env-name Hopper-v1
I get the following error -

Traceback (most recent call last):
File "examples/ppo_gym.py", line 9, in
from utils import *
File "/home/aseem/treeqn_atreec/PyTorch-RL/utils/init.py", line 3, in
from utils.torch import *
File "/home/aseem/treeqn_atreec/PyTorch-RL/utils/torch.py", line 3, in
from torch.autograd import Variable
ImportError: No module named autograd

Could you help me out with it? I suspect there is an issue with setting up PYTHONPATH variable correctly.

Fail to train of GAIL in Ant-v2 environment

I trained your ppo first.

python examples/ppo_gym.py --env-name Ant-v2 --save-model-interval 100

After 500 episodes, I made trajectories.

python gail/save_expert_traj.py --model-path assets/learned_models/Ant-v2_ppo.p

Last, I ran gail.

python gail/gail_gym.py --env-name Ant-v2 --expert-traj-path assets/expert_traj/Ant-v2_expert_traj.p

I implemented Gail and Vail, but I failed to train it too.(but hopper worked well)

Any Ideas?

Few Runtime errors

I received Runtime errors(invalid value in reduce).
I think it's better to use BCEwithlogits as loss criterion for discriminator, its numerically stable.

Confusion about advantage computation

Hey!
I'm a bit confused about why in the code to compute advantages, the previous advantage value is being set to the value of the first env's advantage from the previous time step, ie advantages[i, 0]
(assuming that advantages are structured in dimension/size as (time_steps X num_envs X 1 ))

PyTorch-RL/core/common.py

Line 17 in 15b574f

prev_value = values[i, 0]

Could you link the source for the equations for this whole function?
Thanks!
Gunshi

Concatenation of memories with not terminated episode

PyTorch-RL/core/agent.py

Line 47 in 61960d5

mask = 0 if done else 1

Hi,

Thank you for your code that is really well written! From my understanding, mask is 0 at the end of an episode and 1 otherwise. But, there will be a problem if you concatenate a memory M1 (where the last episode is not terminated) with a memory M2 because after concatenation, the computation of returns will be wrong.

To correct this, I think mask should be 0 at the end of an episode or if num_steps = min_batch_size - 1.

Lucas

How are we using rewards in imitation learning?

Hi, these implementations are amazing, thank you for sharing them. I have a question on how, and rather why are we using rewards in imitation learning?

PyTorch-RL/gail/gail_gym.py

Line 110 in d94e147

rewards = torch.from_numpy(np.stack(batch.reward)).to(dtype).to(device)

In the paper they have mentioned that instead of using the rewards to improve the policy, we use the log of the discriminator value like so (last line before end of for loop):

As you can see above the policy update uses the log of the Discriminator. Could you please explain why is this term being used instead of the reward?

Mountain Car

Thanks for open sourcing this, this is very good stuff.

However, the code doesn't seem to work on mountain car env?
Maybe is it because I have only 2600 expert state/action pairs?

Doubt regarding the calculation of advantage

Hey, thanks for this great repository! I am a beginner in RL and I am trying to understand the practical implementation of TRPO.
What is the purpose of multiplying the variable 'mask' while computing advantage (in estimate_advantage() function)?
And what range of values do 'masks' take?

Thanks!

about the kl

def get_kl(self, x):
    action_prob1 = self.forward(x)
    action_prob0 = action_prob1.detach()
    kl = action_prob0 * (torch.log(action_prob0) - torch.log(action_prob1))
    return kl.sum(1, keepdim=True)

Shouldn't kl be two different strategies？ There action_prob1 == action_prob0?? Thank you

Inconsistent action shape when running CartPole-v1

Gym env: CartPole-v1
Affected code
File: gailgym.py

    """update discriminator"""
    for _ in range(1):
        expert_state_actions = torch.from_numpy(expert_traj).to(dtype).to(device)

        g_o = discrim_net(torch.cat([states, actions], 1))
        e_o = discrim_net(expert_state_actions)
        optimizer_discrim.zero_grad()
        discrim_loss = discrim_criterion(g_o, ones((states.shape[0], 1), device=device)) + \
            discrim_criterion(e_o, zeros((expert_traj.shape[0], 1), device=device))
        discrim_loss.backward()
        optimizer_discrim.step()

Error

 g_o = discrim_net(torch.cat([states, actions], 1))
 RuntimeError: invalid argument 0: Tensors must have same number of dimensions: got 2 and 1 at /opt/conda/conda-bld/pytorch-cpu_1544218188686/work/aten/src/TH/generic/THTensorMoreMath.cpp:1324

To reproduce this error:
python ./gail/save_expert_traj.py --model-path assets/learned_models/CartPole-v1_trpo.p --env-name CartPole-v1 --save-model-interval 100
python ./gail/gail_gym.py --env-name CartPole-v1 --expert-traj-path assets/expert_traj/CartPole-v1_expert_traj.p

This happended because CartPole-v1 's action is discrete, hence:
state = [[0.1,0.2,0.3,0.4],[0.1,0.2,0.3,0.4]]
action = [1,0]
When performing torch.cat thrown this error
Fix suggestion

  """update discriminator"""
    for _ in range(1):
        expert_state_actions = torch.from_numpy(expert_traj).to(dtype).to(device)

        if len(actions.shape) == 1:
            g_o = discrim_net(torch.cat([states, actions.unsqueeze(1)], 1))
        else:
            g_o = discrim_net(torch.cat([states, actions], 1))
        e_o = discrim_net(expert_state_actions)
        optimizer_discrim.zero_grad()
        discrim_loss = discrim_criterion(g_o, ones((states.shape[0], 1), device=device)) + \
            discrim_criterion(e_o, zeros((expert_traj.shape[0], 1), device=device))
        discrim_loss.backward()
        optimizer_discrim.step()

save_expert_traj doesn't cause error since np.hstack stack element instead concating them.
expert_traj.append(np.hstack([state, action]))

Training a recurrent policy

I am still struggling with the implementation of a recurrent policy. The trick from #1 worked and I can now start running my RNN GAIL Network. But no matter what I try the mean reward is actually decreasing over time.

I am currently using the same ValueNet and Advantage Estimation as in the repository.

Do I have to change something in trpo_step in order to make RNN Policies work?

Thank you so much!

question about weight init

Hi,
Your implementation is great and easy to read.
I just had one question though, from the line:

PyTorch-RL/models/mlp_policy.py

Line 24 in 15b574f

self.action_mean.weight.data.mul_(0.1)

Is there any particular reason why the weights are initialized like that(instead of the normal gaussian/xavier's scheme) with that specific scale?
Thanks!
Gunshi

Entropy Term for GAIL

In the paper https://arxiv.org/pdf/1606.03476.pdf, (Jonathan Ho GAIL's paper), there is a Causal Entropy term for Policy network update step. I could not find any where in your code, anything related to that entropy. Did you skip the entropy term? Would it not completely change the objective function, the whole derivation depended on Maximum Causal Entropy IRL, which just became a kind of regularizer at the end.

I think if Causal Entropy term is included, then instead of maximum entropy IRL you are doing something else implicitly, which may be wrong/suboptimal? Correct me if I'm wrong.

Implementation problem

Hi Dr Yuan, firstly, thank you for your great work!
However, when I implement PPO, the multiprocessing has a problem on my desktop. I'm using python3.6 with PyTorch 1.7.1 on Ubuntu 20.04 with CartPole-v0. In the first loop of i_iter in main_loop in ppo_gym.py, when evaluating the trained model, the code stucks.
After testing, I find it is because in process-4 (one process created here, the action cannot be computed here which is weird. I didn't change anything, and I don't think the multiprocessing should affect the computing with the network. Do you have any idea why this happens?

What's Conjugate gradients and line_search in TRPO?

Could you please give me a sense/reference what these two func meaing for?

TRPO: KL Divergence Computation

I see how KL divergence is computed here:
def get_kl(self, x): action_prob1 = self.forward(x) action_prob0 = action_prob1.detach() kl = action_prob0 * (torch.log(action_prob0) - torch.log(action_prob1)) return kl.sum(1, keepdim=True)

Isn't this wrong? shouldn't the KL divergence be computed for new policy and old policy? Right now it seems the action_prob1, action_prob0 are same, so KL divergence will always be zero, isn't it?

Question on multiprocessing

Hi @Khrylx, thank you so much for this great repository!
But I found in the Readme:

PyTorch will create additional threads when performing computations which can damage the performance of multiprocessing. This problem is most serious with Linux, where multiprocessing can be even slower than a single thread

and found here you only set one process.
I'm a little bit confused, do you recommend using multiple processes to train the network?
But if it only uses one process to train the network, what's its advantage over memory buffer?
Thank you!

TRPO,Is fixed_log_probs the same as log_probs

in trpo,Is fixed_log_probs the same as log_probs? In the program debugging process, the output of the two is the same, there is no difference between pnew and pold?

 with torch.no_grad():
        fixed_log_probs = policy_net.get_log_prob(states, actions)
    """define the loss function for TRPO"""
    def get_loss(volatile=False):
        with torch.set_grad_enabled(not volatile):
            log_probs = policy_net.get_log_prob(states, actions)
            action_loss = -advantages * torch.exp(log_probs - fixed_log_probs)

CudnnRNN is not differentiable twice

Hi,

First of all thank you very much for providing this great repository!

I am currently implementing GAIL using a MLP as well as an RNN policy net (Two different experiments). The MLP network is working as intended but if I switch to the RNN policy I get a RuntimeError: CudnnRNN is not differentiable twice during execution of this line in core.trpo.Fvp_fim:

Jv = torch.autograd.grad(Jtv, t, retain_graph=True)[0]

The only difference between my MLP and RNN policy implementation is the initialization of the hidden state during get_log_prob and get_fim within my policy class.

Given your recent commit (d66765eecad38ddc3f6e0f33d35ef70a7ed11892) I thought that the network is only differentiating once during TRPO.

Am I doing something wrong or is the network still differentiating twice ?

Thank you very much!

What's Conjugate gradients and line_search in TROP?

Could you please give me a sense/reference what these two func meaing for?

Memory leak during GPU training

Hi,

I'm not sure if this is a PyTorch specific issue but during training my GPU memory is increasing immensely. For my experiment I'm using a high number of samples to optimize my discriminator and policy (~ 4.000 samples, where state is a 51 dimensional vector and action is a 2 dimensional vector). The network trains fine at first but runs into a out of memory exception after a few epochs. This happens for both TRPO and PPO.

I can reproduce this issue by using your gail.gail_gym script and changing the argument min-batch-size to 100.000 (the problem probably exists with smaller batch sizes as well - but this makes the increase more obvious).

The memory consumption starts with 900 MB after the first epoch (which seems reasonable) but increases to 1.300 MB after the third and to 1.700 after epoch ~ 160. Note that the increase is not constant but happens at once after an arbitrary number of epochs.

I tried to del variables after usage but this does not solve it.

I'm using CUDA 7.5 and PyTorch 3.0.

Thank you very much!

result is not good

Various questions?

Hi,

Thanks a lot for this extremely useful implementation.

I wanted just to ask what is the ZFilter class, is it used to standardize the observed state according to the running mean and std of the observed states?

In addition, in the GAIL paper, they consider in the TRPO update a step in the direction of the gradient of the entropy. Is it considered here? I am not managing to find it in the code.

Thank you in advance.

Luca

question about A2C

Did you try training A2C agent on Swimmer environment? I was not able to train it! Tested many NN parameters, but was unsuccessful.

CNN Policy

Can you please add an example of a CNN policy? All the code is oriented towards MLP policies.

Failed on " HopperEnv' object has no attribute 'seed' "

Traceback (most recent call last):
File "/home/zjt/exp/PyTorch-RL/examples/ppo_gym.py", line 70, in
env.seed(args.seed)
File "/home/zjt/anaconda3/envs/mujoco/lib/python3.9/site-packages/gym/core.py", line 241, in getattr
return getattr(self.env, name)
File "/home/zjt/anaconda3/envs/mujoco/lib/python3.9/site-packages/gym/core.py", line 241, in getattr
return getattr(self.env, name)
File "/home/zjt/anaconda3/envs/mujoco/lib/python3.9/site-packages/gym/core.py", line 241, in getattr
return getattr(self.env, name)
AttributeError: 'HopperEnv' object has no attribute 'seed'

When "python examples/ppo_gym.py --env-name Hopper-v2", the program encountered an error with env.seed(args.seed), and even after commenting out this line, it ran into another error related to 'seed': AttributeError: 'numpy.random._generator.Generator' object has no attribute 'seed'.

A question bout PPO implementation

Hi Dr. Yuan,

I successfully applied D3QN on a robotics navigation task using images. However, when I use your implemented PPO to deal with this task, it seems to not work with similar hyper-parameters as D3QN (like learning rate, gamma, network structure, and etc.).

Actually, your implementation is quite great and has shown amazing performance in multiple other tasks. And for this navigation task, the environment settings are the same as when using D3QN. Do you have any suggestions on this situation that PPO cannot be implemented on this specific task? What kinds of hyper-parameters and factors should I pay attention to?

I'm so sorry for this inconvenience and thank you for your time!