While using RecurrentPPO , I noticed rather high value

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

[question] Question regarding RecurrentPPO and the magnitude of approx_kl / clip_fraction / entropy_loss,about stable-baselines-team/stable-baselines3-contrib

Comments (16)

araffin commented on May 22, 2024 1

@rnederstigt I know what is going on for the entropy at least.

Because we allow any batch size, we do some masking (to not back-propagate through padded states) and therefore the entropy loss is biased towards zero:

entropy_loss = -th.mean(entropy * rollout_data.mask)

(I'm not 100% sure this is the way to compute it though :/)
if you remove the masking, you get the expected value of -1.4:

entropy_loss = -th.mean(entropy)

For the same reason (masking), the approx kl might be off, I would highly appreciate if someone could double check that ;)

from stable-baselines3-contrib.

araffin commented on May 22, 2024 1

For the same reason (masking), the approx kl might be off, I would highly appreciate if someone could double check that ;)

I meant double check the logic. The implementation actually works (cf. link to the benchmark in the doc).

from stable-baselines3-contrib.

rnederstigt commented on May 22, 2024 1

Applying a correct masked mean and standard deviation on all relevant quantities resolves the issue.

In favour of readability, I used a masked mean of the form,

# Convert mask from float to boole
mask = rollout_data.mask > 1e-8
entropy_loss = - th.mean(entropy[mask])

rather than,

# Calculate the correct masked mean norm 
mask_norm = 1 / th.sum(rollout_data.mask)
entropy_loss = -th.sum(entropy) * mask_norm

which might incur a performance hit.

from stable-baselines3-contrib.

Miffyli commented on May 22, 2024

Interesting insights! Though, my immediate knee-jerk reaction is "did you run for multiple runs per setup"? This could very well be results from random initialization.

But yes, there are no guarantees that that the random initialization of the network will lead to totally random actions (or, maybe there are some studies into this, but as far as I am aware, it is nowhere guaranteed). Beyond that the blame is on the recurrent network architecture used in RecurrentPPO. I have not seen others doing similar studies as you did now (although it might be studied in some paper or be noted there). While it is not expected behaviour, I am not surprised the different architecture changes the behaviour so much :)

But please, try running both setups multiple times, and compare the average results of the two setups for a more solid insight.

from stable-baselines3-contrib.

VVIERV00 commented on May 22, 2024

I just came to the repo to ask the exact same question.
I can see the same behavior in my custom environment. Default PPO works great with the same hyperparameters where RecurrentPPO fails (just changed bs and steps due to RPPO consuming much more RAM). The actions chosen are much less variated in the later (multiDiscrete action scheme).
My last step to 'match' the previous set up is defining a custom LSTM policy architecture (custom feature extractor is the same for both). I will share it here in case it changes this strange behavior, which definitely is not working out on my env.

from stable-baselines3-contrib.

rnederstigt commented on May 22, 2024

@Miffyli That's fair. I came across this issue while modifying RecurrentPPO to allow for PopArt normalisation and reward transformations in order to solve a custom environment. Coming from PPO2 from Stable Baselines, the magnitude difference of the approx_kl in RecurrentPPO struck me immediately when I first tried to solve the custom environment.

After that I did a couple of runs of the LunarLanderNoVel-v2 environment with the hyper parameters taken from the RL Zoo, seeing if my messing around in de code introduced bugs. But all runs were consistent with the results stated on https://wandb.ai/sb3/no-vel-envs/reports/PPO-vs-RecurrentPPO-aka-PPO-LSTM-on-environments-with-masked-velocity–VmlldzoxOTI4NjE4. And all had the apparent magnitude discrepancy of the approx_kl and entropy_loss (the examples provided are representable for these tests)

But to make a better apples to apples comparison I'll do multiple runs comparing PPO2 with RecurrentPPO, that should at least rule out the network architecture difference.

One thing that immediately pops out is that PPO2 asserts that "For recurrent policies the number of environments run in parallel should be a multiple of nminibatches." when trying to use n_env = 32, n_steps = 512, batch_size = 128 (where batch size is given as batch_size = n_env * n_steps / nminibatches) as given by RL Zoo.

from stable-baselines3-contrib.

rnederstigt commented on May 22, 2024

I performed multiple runs trying (four) to compare PPO2 with RecurrentPPO. The results show that all three metrics, approx_kl, clip_fraction and entropy wildly differ in shape and magnitude.

And as I was taught to always show my work :)

class NoVel_v2(gym.ObservationWrapper):
    def __init__(self, env: gym.Env):
        gym.ObservationWrapper.__init__(self, env)
        self.env.observation_space = gym.spaces.Box(-np.inf, +np.inf, shape = (6, ), dtype = np.float32)
    def observation(self, obs: np.ndarray) -> np.ndarray:        
        return obs[[True, True, False, False, True, True, True, True]]

*As I am looking through the documentation of LunarLander, I realise I did not mask the angular speed during these tests, i.e. obs[5] should also be masked in addition to obs[2] and obs[3].. But that shouldn't change these results.

PPO2 (SB)

from stable_baselines.common.tf_layers import ortho_init

vec_env = DummyVecEnv([lambda: Monitor(NoVel_v2(gym.make("LunarLander-v2")), filename = None) for _ in range(32)])
vec_env = VecNormalize(vec_env, norm_reward = True, gamma = .999)

policy_kwargs = dict(
    act_fun = tf.nn.relu,
    net_arch = ['lstm', dict(pi = [64], vf = [64])],
    n_lstm = 64)

model = PPO2('MlpLstmPolicy', vec_env, policy_kwargs = policy_kwargs,
    n_steps = 512,  nminibatches = 8, noptepochs = 4,
    ent_coef = 0.01, vf_coef = .5, gamma = 0.999, lam = .98,
    verbose = 1, tensorboard_log = PATH + '/tensorboard')

## Ortho init LSTM
params = model.get_parameters()

wx = params['model/lstm1/wx:0']
wh = params['model/lstm1/wh:0']
b  = params['model/lstm1/b:0']

h = wx.shape[1] // 4
for i in range(4):
    wx[:, i * h : (i + 1) * h] = ortho_init(1.0)(shape = (wx.shape[0], h))
    wh[:, i * h : (i + 1) * h] = ortho_init(1.0)(shape = (wh.shape[0], h))

b[h : 2 * h] = 1.

model.load_parameters(params)

Note that I opted to also orthogonally initialise the values of the LSTM module.

RecurrentPPO (SB3)

vec_env = DummyVecEnv([lambda: Monitor(NoVel_v2(gym.make("LunarLander-v2")), filename = None) for _ in range(32)])
vec_env = VecNormalize(vec_env, norm_reward = True, gamma = .999)

policy_kwargs = dict(
    net_arch = [dict(pi = [64], vf = [64])],
    activation_fn = th.nn.ReLU,
    lstm_hidden_size = 64,
    shared_lstm = True,
    enable_critic_lstm = False,
)

model = RecurrentPPO('MlpLstmPolicy', vec_env, policy_kwargs = policy_kwargs,
    n_steps = 512,  batch_size = 2048, n_epochs = 4,
    ent_coef = 0.01, vf_coef = .5, gamma = 0.999, gae_lambda = .98,
    verbose = 1, tensorboard_log = PATH + '/tensorboard')

## Ortho init LSTM
lstm = model.policy.lstm_actor

wx = lstm.weight_ih_l0
wh = lstm.weight_hh_l0
bx = lstm.bias_ih_l0
bh = lstm.bias_hh_l0

h = wx.shape[0] // 4
for i in range(4):
    nn.init.orthogonal_(wx[i * h : (i + 1) * h]);
    nn.init.orthogonal_(wh[i * h : (i + 1) * h]);

nn.init.zeros_(bx);
nn.init.constant_(bx[h : h * 2], .5);
nn.init.zeros_(bh);
nn.init.constant_(bh[h : h * 2], .5);

TensorBoard results

The top row corresponds to PPO2 while the bottom row corresponds to RecurrentPPO. It useful to mention that Stable Baselines reports the normalised reward values, so we should focus on the difference in magnitude there (only the sign). Looking at the crossover point this might indicate that PPO2 learns faster, but this should be taken with a grain of salt.

from stable-baselines3-contrib.

araffin commented on May 22, 2024

Hello,

I noticed rather high values for approx_kl and clip_fraction.

yes, I observed that too, but RecurrentPPO was able to solve the different envs.
(btw, the mask wrapper is defined here: https://github.com/DLR-RM/rl-baselines3-zoo/blob/master/utils/wrappers.py#L344

Which happens with PPO, but the RecurrentPPO run it starts off lower, which seems odd to me.

that's weird indeed, probably initialization is not right.

One thing that immediately pops out is that PPO2 asserts that "For recurrent policies the number of environments run in parallel should be a multiple of nminibatches."

yes, this was an annoying limitation that I removed by doing some masking magic. The code should be fine but if there is a bug that's probably there.

The top row corresponds to PPO2 while the bottom row corresponds to RecurrentPPO

Thanks for running that small study =)
To me, it looks like the two are similar (same magnitude and same dynamic for the different values), no?
and the entropy value is now around 1.4 for RecurrentPPO too?

nevermind, didn't look at the right graph

from stable-baselines3-contrib.

rnederstigt commented on May 22, 2024

I still find it highly suspicious that the entropy of RecurrentPPO seems to follow the same dynamics of approx_kl and clip_value, i.e. it bends back down (getting increasingly random) at 300k steps. While the entropy of PPO2 continues to decline after that point.

I observed this to a greater extend in a custom environment but, haven't been able to replicate it with standard environments..

from stable-baselines3-contrib.

araffin commented on May 22, 2024

yes, there seems to be an issue

from stable-baselines3-contrib.

araffin commented on May 22, 2024

I've been investigating that a bit, and at initialization, nothing is suspicious for RecurrentPPO, the sampled actions are uniformly distributed as long as you don't take sequences into account...

==== Uniform actions in randomly sample obs ====
-log(1/4)            = 1.3862943611198906
Entropy RecurrentPPO = 1.3862944841384888
Entropy PPO          = 1.3851330280303955
Log prob RecurrentPPO = -1.386294960975647
Log prob PPO         = -1.3884332180023193
==== Sample actions in randomly sample obs ====
Log prob RecurrentPPO = -1.3862946033477783
Log prob PPO         = -1.3862944841384888
==== Sampled actions stats ===
Should be close to 1/num_actions = 1/4 = 0.25
uniform
2    0.260667
1    0.249000
0    0.245667
3    0.244667
Name: uniform, dtype: float64
ppo
3    0.269667
1    0.250667
0    0.241000
2    0.238667
Name: ppo, dtype: float64
RecurrentPPO
1    0.258667
2    0.249333
0    0.248333
3    0.243667
Name: RecurrentPPO, dtype: float64

import numpy as np
import torch as th
from sb3_contrib import RecurrentPPO
from sb3_contrib.common.recurrent.type_aliases import RNNStates
from stable_baselines3 import PPO

n_random_actions = 3000

recurrent_model = RecurrentPPO(
    "MlpLstmPolicy",
    "LunarLander-v2",
    verbose=0,
    # to have a big enough rollout buffer
    # in order to create dummy lstm states
    n_steps=n_random_actions,
)


ppo_model = PPO(
    "MlpPolicy",
    "LunarLander-v2",
    verbose=0,
    policy_kwargs=dict(ortho_init=False),
)


env = recurrent_model.get_env()
num_actions = env.action_space.n

# Taking uniform actions in uniformly sampled observations

random_obs = np.array([env.observation_space.sample() for _ in range(n_random_actions)])
single_obs = np.array([random_obs[0] for _ in range(n_random_actions)])
random_actions = np.array([env.action_space.sample() for _ in range(n_random_actions)])


observations, _ = recurrent_model.policy.obs_to_tensor(random_obs)
actions = th.tensor(random_actions, device=observations.device).float()

# Recurrent PPO

# we assume a new episode for each observation
# therefore we don't use lstm states
episode_starts = th.ones_like(actions)
n_layers = 1
n_seq = len(observations)
lstm_states_pi = (
    # (n_steps, n_layers, n_envs, dim) -> (n_layers, n_seq, dim)
    recurrent_model.rollout_buffer.hidden_states_pi[:n_seq].reshape(
        n_layers, n_seq, -1
    ),
    recurrent_model.rollout_buffer.hidden_states_pi[:n_seq].reshape(
        n_layers, n_seq, -1
    ),
)

distribution, new_lstm_states = recurrent_model.policy.get_distribution(
    observations,
    lstm_states=lstm_states_pi,
    episode_starts=episode_starts,
)
log_prob_recurrent = distribution.log_prob(actions)
entropy_recurrent = distribution.entropy()

# PPO
distribution = ppo_model.policy.get_distribution(observations)
log_prob_ppo = distribution.log_prob(actions)
entropy_ppo = distribution.entropy()

print("==== Uniform actions in randomly sample obs ====")
print(f"{'-log(1/4) ':20} = {-np.log(1/4)}")
print(f"{'Entropy RecurrentPPO':20} = {entropy_recurrent.mean().item()}")
print(f"{'Entropy PPO':20} = {entropy_ppo.mean().item()}")
print(f"{'Log prob RecurrentPPO':20} = {log_prob_recurrent.mean().item()}")
print(f"{'Log prob PPO':20} = {log_prob_ppo.mean().item()}")


# Taking actions in uniformly sampled observations

# Recurrent PPO
# we assume a new episode for each observation
# therefore we don't use lstm states
random_actions_recurrent = recurrent_model.predict(observations, deterministic=False)[0]
actions_recurrent = th.tensor(
    random_actions_recurrent, device=observations.device
).float()


distribution, new_lstm_states = recurrent_model.policy.get_distribution(
    observations,
    lstm_states=lstm_states_pi,
    episode_starts=episode_starts,
)
log_prob_recurrent = distribution.log_prob(actions_recurrent)


# PPO
random_actions_ppo = ppo_model.predict(observations, deterministic=False)[0]
actions_ppo = th.tensor(random_actions_ppo, device=observations.device).float()

log_prob_ppo = distribution.log_prob(actions_ppo)


print("==== Sample actions in randomly sample obs ====")
print(f"{'Log prob RecurrentPPO':20} = {log_prob_recurrent.mean().item()}")
print(f"{'Log prob PPO':20} = {log_prob_ppo.mean().item()}")

# Stats about action sampling
import pandas as pd

df = pd.DataFrame(
    {
        "uniform": random_actions,
        "ppo": random_actions_ppo,
        "RecurrentPPO": random_actions_recurrent,
    }
)

print("==== Sampled actions stats ===")
print("Should be close to 1/num_actions = 1/4 = 0.25")

for name in df.keys():
    print(name)
    print(df[name].value_counts() / n_random_actions)

from stable-baselines3-contrib.

VVIERV00 commented on May 22, 2024

Let me have it double checked on my custom env;)

from stable-baselines3-contrib.

rnederstigt commented on May 22, 2024

@araffin That's great to hear!

On the top of my head a correct masked entropy loss could be of the form,

entropy_loss = -th.sum(entropy * rollout_data.mask) / th.sum(rollout_data.mask)

The same holds for calculating the value loss.

from stable-baselines3-contrib.

araffin commented on May 22, 2024

On the top of my head a correct masked entropy loss could be of the form,

@rnederstigt look good to me (at first, I thought a + eps was missing, but that would mean we masked everything), I would appreciate a PR that updates the loss ;) (and probably other quantities that are wrongly computed because of masking).

I'm also glad that the fix is simple =)

from stable-baselines3-contrib.

rnederstigt commented on May 22, 2024

@araffin I have a bit of trouble following the padding logic, but I assume the sequences are just padded with zeros?

from stable-baselines3-contrib.

araffin commented on May 22, 2024

I have a bit of trouble following the padding logic, but I assume the sequences are just padded with zeros?

yes, that's the high level idea (and the result of the hacky code). The padding code is a bit tricky because we have environments in parallel and different reshape that must keep the sequence order (I tried to add as many comments to ease the understanding of that part).

from stable-baselines3-contrib.

[question] Question regarding RecurrentPPO and the magnitude of approx_kl / clip_fraction / entropy_loss about stable-baselines3-contrib HOT 16 CLOSED

Comments (16)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs