Hi, I've started working on implementing TRPO: <a href="https://gith

probably a better idea would be to create a new method <code class="notra

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Sure: <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-

[Feature Request] Implement TRPO,about stable-baselines-team/stable-baselines3-contrib

Comments (12)

Miffyli commented on May 22, 2024 1

This is a wild-guess, but since you mentioned grad_kl: Seems like this code uses single-sample estimate of KL (and then averages over), which is known to sometimes return negative values for KL (see lengthy discussion and update on SB3 here). This is simply based on the "something is negative but shouldn't be" part :D .

Only other tip I can give is looking at other implementations of TRPO and see what they did, e.g. spinning up (alas, they too only have TF1 version of TRPO).

from stable-baselines3-contrib.

cyprienc commented on May 22, 2024 1

Hi,

I've added an assert slightly earlier; inside the conjugate gradient algorithm.
But it points in the same direction, which is that the matrix defined in Hpv is not positive-definite.

@Miffyli Thanks for the approximation trick - Neat one. I'll have a look at it (and its gradients :) ). Other implementations usually use a distribution object (custom or from one of the major framework) which computes the KL directly. I also wanted to do that but wasn't sure where I could find a distribution object for the policy passed - but let me have a better look at it.

Thanks,

Cyprien

from stable-baselines3-contrib.

cyprienc commented on May 22, 2024 1

probably a better idea would be to create a new method get_distribution()

Done

a deepcopy should probably solve that issue, no?

Used a shallow copy; but I am wondering whether it makes more sense to avoid any kind of copy and do the necessary refactoring work to avoid the side-effect. Probably something for the future.

Using the pytorch distribution did the trick. I also refined a few things to avoid numerical instabilities stemming from the CG method.

How would you like to proceed @araffin ?

from stable-baselines3-contrib.

araffin commented on May 22, 2024 1

as mentioned in contrib contributing guide, next step is to match published results, i would start with pybullet envs (i had some results in SB2 zoo)

Regarding the benchmark, once you have created a fork of the rl zoo (cf. guide), I could help you to run it on a larger scale (I have access to a cluster).

from stable-baselines3-contrib.

cyprienc commented on May 22, 2024

@araffin What about ActorCriticPolicy.evaluate_actions returning the Distribution object directly instead of the entropy as the last output? This would allow access to the Distribution.distribution object inside the training loop and compute the analytical KL divergence instead of the sample estimate.

Also, the side-effect in Distribution.proba_distribution called in ActorCriticPolicy._get_action_dist_from_latent means it's not possible to compute the detached old distribution using ActorCriticPolicy.evaluate_actions in a no_grad block because it overrides the parameter of the new distribution (in the image below, the parameters of distribution are replaced with the ones from old_distribution because of the side-effect).

On a side-note, pytorch currently doesn't allow to "detach" a distribution easily, but maybe it could be implemented in SB3's Distribution class.

Cyprien

from stable-baselines3-contrib.

araffin commented on May 22, 2024

I will try to have a deeper look at it soon. In the meantime, I recommend reading part of John Schulman Thesis, notably the "Computing the Fisher-Vector Product" section ;)

What about ActorCriticPolicy.evaluate_actions returning the Distribution object directly instead of the entropy as the last output?

probably a better idea would be to create a new method get_distribution()

in the image below, the parameters of distribution are replaced with the ones from old_distribution because of the side-effect).

a deepcopy should probably solve that issue, no?

EDIT: you can also take a look at Theano implementation and Tianshou one

from stable-baselines3-contrib.

araffin commented on May 22, 2024

as mentioned in contrib contributing guide, next step is to match published results, i would start with pybullet envs (i had some results in SB2 zoo)
and at the same time open a draft PR ;)

from stable-baselines3-contrib.

cyprienc commented on May 22, 2024

Hi,

Sorry for the delay (holidays), I've pushed to a fork of rl zoo: https://github.com/cyprienc/rl-baselines3-zoo
I'll need some help running it on a larger scale since my local compute is not enough.
Let me know what I can do.
Thanks,

Cyprien

from stable-baselines3-contrib.

araffin commented on May 22, 2024

Could you also open a PR?
This will make it easier to review/use ;)

from stable-baselines3-contrib.

cyprienc commented on May 22, 2024

Sure: DLR-RM/rl-baselines3-zoo#163
I'll fill the PR message later, just opened the PR to avoid loosing time.
Thanks,

Cyprien

from stable-baselines3-contrib.

araffin commented on May 22, 2024

i meant a PR to sb3 contrib...

from stable-baselines3-contrib.

cyprienc commented on May 22, 2024

Indeed... #40

from stable-baselines3-contrib.

[Feature Request] Implement TRPO about stable-baselines3-contrib HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs