Comments (12)
This is a wild-guess, but since you mentioned grad_kl
: Seems like this code uses single-sample estimate of KL (and then averages over), which is known to sometimes return negative values for KL (see lengthy discussion and update on SB3 here). This is simply based on the "something is negative but shouldn't be" part :D .
Only other tip I can give is looking at other implementations of TRPO and see what they did, e.g. spinning up (alas, they too only have TF1 version of TRPO).
from stable-baselines3-contrib.
Hi,
I've added an assert slightly earlier; inside the conjugate gradient algorithm.
But it points in the same direction, which is that the matrix defined in Hpv is not positive-definite.
@Miffyli Thanks for the approximation trick - Neat one. I'll have a look at it (and its gradients :) ). Other implementations usually use a distribution object (custom or from one of the major framework) which computes the KL directly. I also wanted to do that but wasn't sure where I could find a distribution object for the policy passed - but let me have a better look at it.
Thanks,
Cyprien
from stable-baselines3-contrib.
probably a better idea would be to create a new method
get_distribution()
Done
a deepcopy should probably solve that issue, no?
Used a shallow copy; but I am wondering whether it makes more sense to avoid any kind of copy and do the necessary refactoring work to avoid the side-effect. Probably something for the future.
Using the pytorch distribution did the trick. I also refined a few things to avoid numerical instabilities stemming from the CG method.
How would you like to proceed @araffin ?
from stable-baselines3-contrib.
as mentioned in contrib contributing guide, next step is to match published results, i would start with pybullet envs (i had some results in SB2 zoo)
Regarding the benchmark, once you have created a fork of the rl zoo (cf. guide), I could help you to run it on a larger scale (I have access to a cluster).
from stable-baselines3-contrib.
@araffin What about ActorCriticPolicy.evaluate_actions
returning the Distribution
object directly instead of the entropy as the last output? This would allow access to the Distribution.distribution
object inside the training loop and compute the analytical KL divergence instead of the sample estimate.
Also, the side-effect in Distribution.proba_distribution
called in ActorCriticPolicy._get_action_dist_from_latent
means it's not possible to compute the detached old distribution using ActorCriticPolicy.evaluate_actions
in a no_grad
block because it overrides the parameter of the new distribution (in the image below, the parameters of distribution
are replaced with the ones from old_distribution
because of the side-effect).
On a side-note, pytorch currently doesn't allow to "detach" a distribution easily, but maybe it could be implemented in SB3's Distribution
class.
Cyprien
from stable-baselines3-contrib.
I will try to have a deeper look at it soon. In the meantime, I recommend reading part of John Schulman Thesis, notably the "Computing the Fisher-Vector Product" section ;)
What about ActorCriticPolicy.evaluate_actions returning the Distribution object directly instead of the entropy as the last output?
probably a better idea would be to create a new method get_distribution()
in the image below, the parameters of distribution are replaced with the ones from old_distribution because of the side-effect).
a deepcopy should probably solve that issue, no?
EDIT: you can also take a look at Theano implementation and Tianshou one
from stable-baselines3-contrib.
as mentioned in contrib contributing guide, next step is to match published results, i would start with pybullet envs (i had some results in SB2 zoo)
and at the same time open a draft PR ;)
from stable-baselines3-contrib.
Hi,
Sorry for the delay (holidays), I've pushed to a fork of rl zoo: https://github.com/cyprienc/rl-baselines3-zoo
I'll need some help running it on a larger scale since my local compute is not enough.
Let me know what I can do.
Thanks,
Cyprien
from stable-baselines3-contrib.
Could you also open a PR?
This will make it easier to review/use ;)
from stable-baselines3-contrib.
Sure: DLR-RM/rl-baselines3-zoo#163
I'll fill the PR message later, just opened the PR to avoid loosing time.
Thanks,
Cyprien
from stable-baselines3-contrib.
i meant a PR to sb3 contrib...
from stable-baselines3-contrib.
Indeed... #40
from stable-baselines3-contrib.
Related Issues (20)
- [Feature Request] BBF algorithm implementation HOT 2
- Decrease in reward during training with MaskablePPO
- Maskable PPO selects illegal actions, altough everything looks correct HOT 2
- How to use LSTM ? RecurrentPPO from sb3-contrib HOT 6
- Worse training with Vectorized Environment
- Recurrent PPO Not Training Well on a Very Simple Environment
- Predicting actions after using MaskablePPO model outputs invalid action HOT 2
- [Question] Recurrent PPO evaluation HOT 2
- [Feature Request] Expand RNN Options and Algorithm Flexibility HOT 2
- [Bug]: producing NAN values during training in MaskablePPO HOT 5
- [Question] how to use "lstm_states" from rollout_buffer to reconstruct LSTM states during training HOT 2
- [Feature Request] STAC algorithm HOT 4
- Implementing "Sibling Rivalry" Method from "Keeping Your Distance: Solving Sparse Reward Tasks Using Self-Balancing Shaped Rewards" Paper HOT 1
- EvalCallback crashes Maskable PPO without error HOT 3
- Episodic training with TQC? HOT 2
- [Question] LSTM observations HOT 3
- [Question] Simple way to implement data augmentation when training agent HOT 2
- [Question] Why does MaskablePPO does not mask with some logic with last observation? HOT 4
- [Feature Request] Implement CrossQ
- [Question] RecurrentPPO: Reset LSTM states early? HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from stable-baselines3-contrib.