Comments (16)
@rnederstigt I know what is going on for the entropy at least.
Because we allow any batch size, we do some masking (to not back-propagate through padded states) and therefore the entropy loss is biased towards zero:
entropy_loss = -th.mean(entropy * rollout_data.mask)
(I'm not 100% sure this is the way to compute it though :/)
if you remove the masking, you get the expected value of -1.4:
entropy_loss = -th.mean(entropy)
For the same reason (masking), the approx kl might be off, I would highly appreciate if someone could double check that ;)
from stable-baselines3-contrib.
For the same reason (masking), the approx kl might be off, I would highly appreciate if someone could double check that ;)
I meant double check the logic. The implementation actually works (cf. link to the benchmark in the doc).
from stable-baselines3-contrib.
Applying a correct masked mean and standard deviation on all relevant quantities resolves the issue.
In favour of readability, I used a masked mean of the form,
# Convert mask from float to boole
mask = rollout_data.mask > 1e-8
entropy_loss = - th.mean(entropy[mask])
rather than,
# Calculate the correct masked mean norm
mask_norm = 1 / th.sum(rollout_data.mask)
entropy_loss = -th.sum(entropy) * mask_norm
which might incur a performance hit.
from stable-baselines3-contrib.
Interesting insights! Though, my immediate knee-jerk reaction is "did you run for multiple runs per setup"? This could very well be results from random initialization.
But yes, there are no guarantees that that the random initialization of the network will lead to totally random actions (or, maybe there are some studies into this, but as far as I am aware, it is nowhere guaranteed). Beyond that the blame is on the recurrent network architecture used in RecurrentPPO
. I have not seen others doing similar studies as you did now (although it might be studied in some paper or be noted there). While it is not expected behaviour, I am not surprised the different architecture changes the behaviour so much :)
But please, try running both setups multiple times, and compare the average results of the two setups for a more solid insight.
from stable-baselines3-contrib.
I just came to the repo to ask the exact same question.
I can see the same behavior in my custom environment. Default PPO works great with the same hyperparameters where RecurrentPPO fails (just changed bs and steps due to RPPO consuming much more RAM). The actions chosen are much less variated in the later (multiDiscrete action scheme).
My last step to 'match' the previous set up is defining a custom LSTM policy architecture (custom feature extractor is the same for both). I will share it here in case it changes this strange behavior, which definitely is not working out on my env.
from stable-baselines3-contrib.
@Miffyli That's fair. I came across this issue while modifying RecurrentPPO
to allow for PopArt normalisation and reward transformations in order to solve a custom environment. Coming from PPO2
from Stable Baselines, the magnitude difference of the approx_kl
in RecurrentPPO
struck me immediately when I first tried to solve the custom environment.
After that I did a couple of runs of the LunarLanderNoVel-v2 environment with the hyper parameters taken from the RL Zoo, seeing if my messing around in de code introduced bugs. But all runs were consistent with the results stated on https://wandb.ai/sb3/no-vel-envs/reports/PPO-vs-RecurrentPPO-aka-PPO-LSTM-on-environments-with-masked-velocity–VmlldzoxOTI4NjE4. And all had the apparent magnitude discrepancy of the approx_kl
and entropy_loss
(the examples provided are representable for these tests)
But to make a better apples to apples comparison I'll do multiple runs comparing PPO2
with RecurrentPPO
, that should at least rule out the network architecture difference.
One thing that immediately pops out is that PPO2
asserts that "For recurrent policies the number of environments run in parallel should be a multiple of nminibatches." when trying to use n_env = 32
, n_steps = 512
, batch_size = 128
(where batch size is given as batch_size = n_env * n_steps / nminibatches
) as given by RL Zoo.
from stable-baselines3-contrib.
I performed multiple runs trying (four) to compare PPO2
with RecurrentPPO
. The results show that all three metrics, approx_kl
, clip_fraction
and entropy
wildly differ in shape and magnitude.
And as I was taught to always show my work :)
class NoVel_v2(gym.ObservationWrapper):
def __init__(self, env: gym.Env):
gym.ObservationWrapper.__init__(self, env)
self.env.observation_space = gym.spaces.Box(-np.inf, +np.inf, shape = (6, ), dtype = np.float32)
def observation(self, obs: np.ndarray) -> np.ndarray:
return obs[[True, True, False, False, True, True, True, True]]
*As I am looking through the documentation of LunarLander, I realise I did not mask the angular speed during these tests, i.e. obs[5]
should also be masked in addition to obs[2]
and obs[3]
.. But that shouldn't change these results.
PPO2 (SB)
from stable_baselines.common.tf_layers import ortho_init
vec_env = DummyVecEnv([lambda: Monitor(NoVel_v2(gym.make("LunarLander-v2")), filename = None) for _ in range(32)])
vec_env = VecNormalize(vec_env, norm_reward = True, gamma = .999)
policy_kwargs = dict(
act_fun = tf.nn.relu,
net_arch = ['lstm', dict(pi = [64], vf = [64])],
n_lstm = 64)
model = PPO2('MlpLstmPolicy', vec_env, policy_kwargs = policy_kwargs,
n_steps = 512, nminibatches = 8, noptepochs = 4,
ent_coef = 0.01, vf_coef = .5, gamma = 0.999, lam = .98,
verbose = 1, tensorboard_log = PATH + '/tensorboard')
## Ortho init LSTM
params = model.get_parameters()
wx = params['model/lstm1/wx:0']
wh = params['model/lstm1/wh:0']
b = params['model/lstm1/b:0']
h = wx.shape[1] // 4
for i in range(4):
wx[:, i * h : (i + 1) * h] = ortho_init(1.0)(shape = (wx.shape[0], h))
wh[:, i * h : (i + 1) * h] = ortho_init(1.0)(shape = (wh.shape[0], h))
b[h : 2 * h] = 1.
model.load_parameters(params)
Note that I opted to also orthogonally initialise the values of the LSTM module.
RecurrentPPO (SB3)
vec_env = DummyVecEnv([lambda: Monitor(NoVel_v2(gym.make("LunarLander-v2")), filename = None) for _ in range(32)])
vec_env = VecNormalize(vec_env, norm_reward = True, gamma = .999)
policy_kwargs = dict(
net_arch = [dict(pi = [64], vf = [64])],
activation_fn = th.nn.ReLU,
lstm_hidden_size = 64,
shared_lstm = True,
enable_critic_lstm = False,
)
model = RecurrentPPO('MlpLstmPolicy', vec_env, policy_kwargs = policy_kwargs,
n_steps = 512, batch_size = 2048, n_epochs = 4,
ent_coef = 0.01, vf_coef = .5, gamma = 0.999, gae_lambda = .98,
verbose = 1, tensorboard_log = PATH + '/tensorboard')
## Ortho init LSTM
lstm = model.policy.lstm_actor
wx = lstm.weight_ih_l0
wh = lstm.weight_hh_l0
bx = lstm.bias_ih_l0
bh = lstm.bias_hh_l0
h = wx.shape[0] // 4
for i in range(4):
nn.init.orthogonal_(wx[i * h : (i + 1) * h]);
nn.init.orthogonal_(wh[i * h : (i + 1) * h]);
nn.init.zeros_(bx);
nn.init.constant_(bx[h : h * 2], .5);
nn.init.zeros_(bh);
nn.init.constant_(bh[h : h * 2], .5);
TensorBoard results
The top row corresponds to PPO2
while the bottom row corresponds to RecurrentPPO
. It useful to mention that Stable Baselines reports the normalised reward values, so we should focus on the difference in magnitude there (only the sign). Looking at the crossover point this might indicate that PPO2
learns faster, but this should be taken with a grain of salt.
from stable-baselines3-contrib.
Hello,
I noticed rather high values for approx_kl and clip_fraction.
yes, I observed that too, but RecurrentPPO
was able to solve the different envs.
(btw, the mask wrapper is defined here: https://github.com/DLR-RM/rl-baselines3-zoo/blob/master/utils/wrappers.py#L344
Which happens with PPO, but the RecurrentPPO run it starts off lower, which seems odd to me.
that's weird indeed, probably initialization is not right.
One thing that immediately pops out is that PPO2 asserts that "For recurrent policies the number of environments run in parallel should be a multiple of nminibatches."
yes, this was an annoying limitation that I removed by doing some masking magic. The code should be fine but if there is a bug that's probably there.
The top row corresponds to PPO2 while the bottom row corresponds to RecurrentPPO
Thanks for running that small study =)
To me, it looks like the two are similar (same magnitude and same dynamic for the different values), no?
and the entropy value is now around 1.4 for RecurrentPPO
too?
nevermind, didn't look at the right graph
from stable-baselines3-contrib.
I still find it highly suspicious that the entropy of RecurrentPPO
seems to follow the same dynamics of approx_kl
and clip_value
, i.e. it bends back down (getting increasingly random) at 300k steps. While the entropy of PPO2
continues to decline after that point.
I observed this to a greater extend in a custom environment but, haven't been able to replicate it with standard environments..
from stable-baselines3-contrib.
yes, there seems to be an issue
from stable-baselines3-contrib.
I've been investigating that a bit, and at initialization, nothing is suspicious for RecurrentPPO
, the sampled actions are uniformly distributed as long as you don't take sequences into account...
==== Uniform actions in randomly sample obs ====
-log(1/4) = 1.3862943611198906
Entropy RecurrentPPO = 1.3862944841384888
Entropy PPO = 1.3851330280303955
Log prob RecurrentPPO = -1.386294960975647
Log prob PPO = -1.3884332180023193
==== Sample actions in randomly sample obs ====
Log prob RecurrentPPO = -1.3862946033477783
Log prob PPO = -1.3862944841384888
==== Sampled actions stats ===
Should be close to 1/num_actions = 1/4 = 0.25
uniform
2 0.260667
1 0.249000
0 0.245667
3 0.244667
Name: uniform, dtype: float64
ppo
3 0.269667
1 0.250667
0 0.241000
2 0.238667
Name: ppo, dtype: float64
RecurrentPPO
1 0.258667
2 0.249333
0 0.248333
3 0.243667
Name: RecurrentPPO, dtype: float64
import numpy as np
import torch as th
from sb3_contrib import RecurrentPPO
from sb3_contrib.common.recurrent.type_aliases import RNNStates
from stable_baselines3 import PPO
n_random_actions = 3000
recurrent_model = RecurrentPPO(
"MlpLstmPolicy",
"LunarLander-v2",
verbose=0,
# to have a big enough rollout buffer
# in order to create dummy lstm states
n_steps=n_random_actions,
)
ppo_model = PPO(
"MlpPolicy",
"LunarLander-v2",
verbose=0,
policy_kwargs=dict(ortho_init=False),
)
env = recurrent_model.get_env()
num_actions = env.action_space.n
# Taking uniform actions in uniformly sampled observations
random_obs = np.array([env.observation_space.sample() for _ in range(n_random_actions)])
single_obs = np.array([random_obs[0] for _ in range(n_random_actions)])
random_actions = np.array([env.action_space.sample() for _ in range(n_random_actions)])
observations, _ = recurrent_model.policy.obs_to_tensor(random_obs)
actions = th.tensor(random_actions, device=observations.device).float()
# Recurrent PPO
# we assume a new episode for each observation
# therefore we don't use lstm states
episode_starts = th.ones_like(actions)
n_layers = 1
n_seq = len(observations)
lstm_states_pi = (
# (n_steps, n_layers, n_envs, dim) -> (n_layers, n_seq, dim)
recurrent_model.rollout_buffer.hidden_states_pi[:n_seq].reshape(
n_layers, n_seq, -1
),
recurrent_model.rollout_buffer.hidden_states_pi[:n_seq].reshape(
n_layers, n_seq, -1
),
)
distribution, new_lstm_states = recurrent_model.policy.get_distribution(
observations,
lstm_states=lstm_states_pi,
episode_starts=episode_starts,
)
log_prob_recurrent = distribution.log_prob(actions)
entropy_recurrent = distribution.entropy()
# PPO
distribution = ppo_model.policy.get_distribution(observations)
log_prob_ppo = distribution.log_prob(actions)
entropy_ppo = distribution.entropy()
print("==== Uniform actions in randomly sample obs ====")
print(f"{'-log(1/4) ':20} = {-np.log(1/4)}")
print(f"{'Entropy RecurrentPPO':20} = {entropy_recurrent.mean().item()}")
print(f"{'Entropy PPO':20} = {entropy_ppo.mean().item()}")
print(f"{'Log prob RecurrentPPO':20} = {log_prob_recurrent.mean().item()}")
print(f"{'Log prob PPO':20} = {log_prob_ppo.mean().item()}")
# Taking actions in uniformly sampled observations
# Recurrent PPO
# we assume a new episode for each observation
# therefore we don't use lstm states
random_actions_recurrent = recurrent_model.predict(observations, deterministic=False)[0]
actions_recurrent = th.tensor(
random_actions_recurrent, device=observations.device
).float()
distribution, new_lstm_states = recurrent_model.policy.get_distribution(
observations,
lstm_states=lstm_states_pi,
episode_starts=episode_starts,
)
log_prob_recurrent = distribution.log_prob(actions_recurrent)
# PPO
random_actions_ppo = ppo_model.predict(observations, deterministic=False)[0]
actions_ppo = th.tensor(random_actions_ppo, device=observations.device).float()
log_prob_ppo = distribution.log_prob(actions_ppo)
print("==== Sample actions in randomly sample obs ====")
print(f"{'Log prob RecurrentPPO':20} = {log_prob_recurrent.mean().item()}")
print(f"{'Log prob PPO':20} = {log_prob_ppo.mean().item()}")
# Stats about action sampling
import pandas as pd
df = pd.DataFrame(
{
"uniform": random_actions,
"ppo": random_actions_ppo,
"RecurrentPPO": random_actions_recurrent,
}
)
print("==== Sampled actions stats ===")
print("Should be close to 1/num_actions = 1/4 = 0.25")
for name in df.keys():
print(name)
print(df[name].value_counts() / n_random_actions)
from stable-baselines3-contrib.
Let me have it double checked on my custom env;)
from stable-baselines3-contrib.
@araffin That's great to hear!
On the top of my head a correct masked entropy loss could be of the form,
entropy_loss = -th.sum(entropy * rollout_data.mask) / th.sum(rollout_data.mask)
The same holds for calculating the value loss.
from stable-baselines3-contrib.
On the top of my head a correct masked entropy loss could be of the form,
@rnederstigt look good to me (at first, I thought a + eps
was missing, but that would mean we masked everything), I would appreciate a PR that updates the loss ;) (and probably other quantities that are wrongly computed because of masking).
I'm also glad that the fix is simple =)
from stable-baselines3-contrib.
@araffin I have a bit of trouble following the padding logic, but I assume the sequences are just padded with zeros?
from stable-baselines3-contrib.
I have a bit of trouble following the padding logic, but I assume the sequences are just padded with zeros?
yes, that's the high level idea (and the result of the hacky code). The padding code is a bit tricky because we have environments in parallel and different reshape that must keep the sequence order (I tried to add as many comments to ease the understanding of that part).
from stable-baselines3-contrib.
Related Issues (20)
- [Question] Recurrent PPO evaluation HOT 2
- [Feature Request] Expand RNN Options and Algorithm Flexibility HOT 2
- [Bug]: producing NAN values during training in MaskablePPO HOT 5
- [Question] how to use "lstm_states" from rollout_buffer to reconstruct LSTM states during training HOT 2
- [Feature Request] STAC algorithm HOT 4
- Implementing "Sibling Rivalry" Method from "Keeping Your Distance: Solving Sparse Reward Tasks Using Self-Balancing Shaped Rewards" Paper HOT 1
- EvalCallback crashes Maskable PPO without error HOT 3
- Episodic training with TQC? HOT 2
- [Question] LSTM observations HOT 3
- [Question] Simple way to implement data augmentation when training agent HOT 2
- [Question] Why does MaskablePPO does not mask with some logic with last observation? HOT 4
- [Feature Request] Implement CrossQ
- [Question] RecurrentPPO: Reset LSTM states early? HOT 3
- [Question] What is the difference between old_distribution and distribution in train function of TRPO HOT 2
- [Question] Recurrent Maskable PPO ?!? Rudder ?!? HOT 1
- Dependent Actions in MultiDiscrete Action Space HOT 1
- ep_len_mean discrepancy
- TQC: ep_len_mean and ep_rew_mean does not match real values
- RecurrentActorCriticPolicy Behaviour Not Clear HOT 1
- MaskablePPO Masking Doesn't Work with Big Action Space HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from stable-baselines3-contrib.