Contrib package for Stable-Baselines3 - Experimental reinforcement learning (RL) code

Home Page: https://sb3-contrib.readthedocs.io

License: MIT License

Makefile 0.27% Python 99.69% Shell 0.03%

experimental gsde gym machine-learning openai pytorch reinforcement-learning reinforcement-learning-algorithms research rl robotics sde stable-baselines

stable-baselines3-contrib's People

Contributors

Stargazers

Watchers

Forkers

guyk1971 eladyaniv01 hezez jdily aureliantactics janebert kronion satoshinakamotodotmyselfpointcom ac-93 yikuide 09tangriro pcarrenom aniruddha123reinforcement minhlong94 steckdenis cyprienc frankroeder wenry55 ayeright sgillen seunghoon00 glmcdona dtch1997 azetaaa pstansell sycz00 apuny678 evelynmitchell nestoraschalkidis skalermo qgallouedec bestquark vwxyzjn benwaldner walon1998 crislmfroes daniellawson9999 mrzhuzhe georgjohn cheeseboy8020 pendu gregwar williamshen-nz arjun-kg pierreclavier mclean-connor rpsebastian tibigg thisguy726 yuixzero rabdumalikov abdelhakim96 andrzejmalota rodrigodelazcano joonwooahn ycl010203 sonsang zikangxiong mlodel a-rud liuyuqi123 svolokh cppmaster royale burakdmb mieldehabanero polzounov honglu2875 harukikozukapenguin ibagur atzheng aryan-iden-khojandi stjordanis jds8 yinqiangzhang goodai rehoss azure-vision b-vm floyedshen alexpasqua luisgarciar micdoh younik craigluo tzs930 raikopipe liamf555 zbenmo thesquarederror vinh1911 irll padpy kkc90 seandasheep pik-gane rhaps0dy 1999michael andreped jonasreiher

stable-baselines3-contrib's Issues

[Feature Request] Invalid action masking

🚀 Feature

Add support for masking out invalid actions when making predictions. This would allow models to converge much faster for environments where many actions may be illegal depending on the state.

This feature could be implemented as a wrapper around a gym environment that adds a method to return the mask. The stable baselines algorithms would check for the wrapper and use the mask if available. The mask is a boolean tensor in the shape of the action space, and it replaces the logits for invalid actions with very large negative values in the underlying probability distribution.

Here is an MVP.

Motivation

In many environments, there may be portions of the action space that are invalid to select in a given state. Without a way to avoid sampling these actions, training becomes less efficient. Models have to waste time exploring the invalid portions of the space, which may become prohibitively expensive for large action spaces. See this paper for more details.

Alternatives

To the best of my knowledge, the only alternative is to accept that invalid actions may be selected, and to try to discourage it by penalizing the choice with a large negative reward. This is just the status quo. Note that action masking would be optional, and the status quo would stay the default.

Additional context

I ran into this problem in practice when building models for board games. To work around it, I implemented an MVP of this feature in a fork. I'd be happy to make a PR. The branch is here.

Also related: hill-a/stable-baselines#453

Original issue here: DLR-RM/stable-baselines3#269

Checklist

I have checked that there is no similar issue in the repo (required)

SubProcVecEnv with MaskablePPO

First of all thank you for creating this repo, I've been trying to implement masking for a couple weeks until I found you already had it going!

Anyways, I was wondering if MaskablePPO was coded to work with vectorised environments? I have tried using SubProcVecEnv over CartPole. Minimal code:

def make_env(env_id, rank, seed=0):
    def _init():
        env = gym.make(env_id)
        env.seed(seed + rank) 
        return env
    set_random_seed(seed)
    return _init

env_id='CartPole-v1'
nproc=2
env = SubprocVecEnv([make_env(env_id, i+total_procs) for i in range(nproc)], start_method='fork')
def mask_fn(env):
    return [[0,1]] ## simple mask 
env = ActionMasker(env, mask_fn)  
model = MaskablePPO(MaskableActorCriticPolicy, env).learn(1e5)

Error that I get:

   File "/Users/ak2135_admin/Dropbox/gymEOEnv/learn_maskable.py", line 82, in main
     model.learn(time_steps)
 
   File "/Users/ak2135_admin/opt/anaconda3/envs/Hybrid_Framework/lib/python3.7/site-packages/sb3_contrib/ppo_mask/ppo_mask.py", line 528, in learn
     use_masking,
 
   File "/Users/ak2135_admin/opt/anaconda3/envs/Hybrid_Framework/lib/python3.7/site-packages/sb3_contrib/ppo_mask/ppo_mask.py", line 251, in _setup_learn
     self._last_obs = self.env.reset()
 
   File "/Users/ak2135_admin/opt/anaconda3/envs/Hybrid_Framework/lib/python3.7/site-packages/stable_baselines3/common/vec_env/dummy_vec_env.py", line 62, in reset
     self._save_obs(env_idx, obs)
 
   File "/Users/ak2135_admin/opt/anaconda3/envs/Hybrid_Framework/lib/python3.7/site-packages/stable_baselines3/common/vec_env/dummy_vec_env.py", line 92, in _save_obs
     self.buf_obs[key][env_idx] = obs
 
 ValueError: could not broadcast input array from shape (2,4) into shape (4)

I figure it would be straightforward to vectorise the environment as it is in sb3. Not sure if this is from my end or if parallel processing is not yet implemented for MaskblePPO. I would love to help (given some pointers) if it's something that can be added.

all packages should be up to date.

[Feature Request] Implement MBPO algorithm

Important Note: We do not do technical support, nor consulting and don't answer personal questions per email.
Please post your question on the RL Discord, Reddit or Stack Overflow in that case.

🚀 Feature

I would like to implement a model-based RL algorithm, MBPO proposed here.

Motivation

The proposed algorithm claims to be simpler and up to 10x as sample efficient as some other baselines like SAC.
This would be helpful in my own work too.

Checklist

[ x] I have checked that there is no similar issue in the repo (required)

a bug with MultiDiscrete action space when actions are not same size

when the environment has a action space that each action has different size, like this:
self.action_space = MultiDiscrete([3,2])

and the action masker is like this for example:
a = [[True, False, True],[Flase, True]]

following error happen since each "a" 's row is not same size:

`File ~\AppData\Roaming\Python\Python39\site-packages\sb3_contrib\common\maskable\distributions.py:228, in MaskableMultiCategoricalDistribution.apply_masking(self, masks)
    226 split_masks = [None] * len(self.distributions)
    227 if masks is not None:
--> 228     masks = th.as_tensor(masks)
    230     # Restructure shape to align with logits
    231     masks = masks.view(-1, sum(self.action_dims))

TypeError: can't convert np.ndarray of type numpy.object_. The only supported types are: float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint8, and bool.`

ARS Returns 0 actions on evaluation env

Hi! During training of the ARS model we obtain valid actions but during evaluation we obtain actions of 0. Both the evaluation and train environment are the same except for using different but similar data. Could this be a bug? model.predict simply returns 0 actions.

[BUG] TRPO Agent not working for Multi Discrete Action Space

Continuing on #83. Line no. 691 in distribution.py is throwing error.

ISSUE:
Rather than comparing the complete ndarray, individual elements are compared and assert is throwing error.

Wrappers/Callbacks from the rl zoo

TODO: move custom wrapper/callbacks from the zoo to contrib (e.g. time feature wrapper)

Questions regarding BPTT (backpropagation through time)

Hi,

This is more a question. I am implementing some specific experiments using Recurrent PPO, but at some point I would like to set the number of BPTT steps, I mean in a truncated BPTT fashion (let's say I want a recurrence of 32 steps, for example). My questions are:

In the current implementation, how many BPTT steps are performed?
Is it possible to change this as an hyper-parameter?

I had a look in the code but haven't managed to figure out where this is performed.

Many thanks in advance!

Add dictionary observation support for ppo_mask

Implemented here:
https://github.com/glmcdona/stable-baselines3-contrib/tree/glmcdona-ppo-mask-dict-obs

Changes:
glmcdona@80c48bc

Generally just mirrors the code from stable_baselines3 ppo into ppo_mask. Alright if I create a pull request for feedback and consideration?

[question] Maskable PPO: creating an actions mask for multibinary action space ISSUE

I'm trying to use MaskablePPO but I have a problem with the dimensions of the mask ( what should be returned in env.valid_action_mask())!
My costume environment action space is

self.action_space = spaces.MultiBinary(4) # 4 variables each has only two options
self.observation_space = spaces.Box(low=-500, high=255,
                                        shape=(16,), dtype=np.float64)

My env valid_action_mask is (basically, I take only the first four features of the observation, and if it's 1, then the corresponding action is invalid)

    def valid_action_mask(self):
        obs = self.observation[0:4]
        mask =  np.array([s != 1 for s in obs]) # [True, False, True, True}
        return mask

This is the training code

def mask_fn(env: gym.Env) -> np.ndarray:
    # Do whatever you'd like in this function to return the action mask
    # for the current env. In this example, we assume the env has a
    # helpful method we can rely on.
    return env.valid_action_mask()

env = pltnMrgEnv()
env = ActionMasker(env, mask_fn)  # Wrap to enable masking
env.reset()

model = MaskablePPO(MaskableActorCriticPolicy, env, verbose=1)
model.learn(total_timesteps=100000)

And I get this error that doesn't make sense!

Traceback (most recent call last):
  File "c:\Users\ali\Merge_platoon_sim_vscode\trainWmaskableppo.py", line 49, in <module>
    model.learn(total_timesteps=100000)
  File "C:\Users\ali\anaconda3\lib\site-packages\sb3_contrib\ppo_mask\ppo_mask.py", line 559, in learn
    continue_training = self.collect_rollouts(self.env, callback, self.rollout_buffer, self.n_steps, use_masking)
  File "C:\Users\ali\anaconda3\lib\site-packages\sb3_contrib\ppo_mask\ppo_mask.py", line 330, in collect_rollouts
    actions, values, log_probs = self.policy(obs_tensor, action_masks=action_masks)
  File "C:\Users\ali\anaconda3\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\ali\anaconda3\lib\site-packages\sb3_contrib\common\maskable\policies.py", line 115, in forward
    distribution.apply_masking(action_masks)
  File "C:\Users\ali\anaconda3\lib\site-packages\sb3_contrib\common\maskable\distributions.py", line 231, in apply_masking
    masks = masks.view(-1, sum(self.action_dims))
RuntimeError: shape '[-1, 8]' is invalid for input of size 4

I tracked the error and the problem is self.action_dims is [ 2,2,2,2], which makes the sum equal to 8, and the mask length is four elements only!
Should I make the mask length 8? if so, what should the shape be?
I'm assuming because it's binary, making the logits a large negative number makes the output always zero, which means the mask needs to be equal to the number of actions!!
I'm not sure what I'm missing! Please help!
Thanks

TRPO Agent not working for multi discrete action space

I am trying to implement TRPO agent for a custom environment with MultiDiscrete Action Space. But I am getting "NotImplementedError". The code and system are fine because same code is working for a Discrete Action Space.

from gym import spaces, Env
from sb3_contrib import TRPO
import numpy as np

class TestEnv(Env):

    def __init__(self):

        self.action_space = spaces.MultiDiscrete([2,2])
        # self.action_space = spaces.Discrete(4)
        self.observation_space = spaces.Box(low=np.array([0,0]), high=np.array([100,100]))

    def reset(self):
        return np.array([10,10]).astype(np.float32)

    def step(self, action):
        print(action)

        return np.array([10+action[0],10+action[1]]).astype(np.float32), -1, False, {}
        # return np.array([10+action//2,10+action%2]).astype(np.float32), -1, False, {}

from stable_baselines3.common.env_checker import check_env

train_env = TestEnv()
check_env(train_env)

model = TRPO('MlpPolicy', train_env,n_steps=5, batch_size=5)
model.learn(2)

Output Error

Traceback (most recent call last):
  File "/Users/akashgarg/Desktop/PycharmProjects/rlbaselinee_practice/Burger-Dog-AI/enjoy.py", line 27, in <module>
    model = TRPO('MlpPolicy', train_env,n_steps=5, batch_size=5)
  File "/Users/akashgarg/Desktop/PycharmProjects/rlbaselinee_practice/lib/python3.7/site-packages/sb3_contrib/trpo/trpo.py", line 426, in learn
    reset_num_timesteps=reset_num_timesteps,
  File "/Users/akashgarg/Desktop/PycharmProjects/rlbaselinee_practice/lib/python3.7/site-packages/stable_baselines3/common/on_policy_algorithm.py", line 267, in learn
    self.train()
  File "/Users/akashgarg/Desktop/PycharmProjects/rlbaselinee_practice/lib/python3.7/site-packages/sb3_contrib/trpo/trpo.py", line 282, in train
    kl_div = kl_divergence(distribution.distribution, old_distribution.distribution).mean()
  File "/Users/akashgarg/Desktop/PycharmProjects/rlbaselinee_practice/lib/python3.7/site-packages/torch/distributions/kl.py", line 169, in kl_divergence
    raise NotImplementedError
NotImplementedError

System Info
OS: Darwin-21.5.0-x86_64-i386-64bit Darwin Kernel Version 21.5.0: Tue Apr 26 21:08:22 PDT 2022; root:xnu-8020.121.3~4/RELEASE_X86_64
Python: 3.7.0
Stable-Baselines3: 1.5.1a7
PyTorch: 1.11.0
GPU Enabled: False
Numpy: 1.21.6
Gym: 0.21.0

Can't import RecurrentPPO

from sb3_contrib import RecurrentPPO

Describe the bug
Can't import RecurrentPPO

Traceback (most recent call last): File ...

ImportError: cannot import name 'RecurrentPPO' from 'sb3_contrib' (/Users/adiya/opt/anaconda3/lib/python3.8/site-packages/sb3_contrib/__init__.py)

System Info
Describe the characteristic of your environment:

Stable-Baselines3 and sb3-contrib versions 1.5.0
Python version 3.8

Custom network with image augmentation layer

Hi all,

First off thanks for the hard work you have put in creating stable baselines 3, it's helped me a bunch.

I have a fairly simple suggestion that I think fits in the contrib repo. In my work I've been using a custom network with image augmentation (almost exclusively random translations) applied, this seems to help boost performance and stabilize training.

There's been a few fairly recent papers that have applied this effectively: CURL, RAD, DrQ.

I've been using Kornia as a simple drop in to apply augmentations before feeding into the feature extractor layers, so the network looks something like this.

'''
class ImageAugNatureCNN(BaseFeaturesExtractor):

def __init__(self, observation_space: gym.spaces.Box,
             features_dim: int = 512,
             apply_augmentation: bool = True,
             shift_range: List[float] = [0.0, 0.0],
             zoom_range: List[float]  = [1.0, 1.0] ):

    super(ImageAugNatureCNN, self).__init__(observation_space, features_dim)
    # We assume CxHxW images (channels first)
    # Re-ordering will be done by pre-preprocessing or wrapper
    assert is_image_space(observation_space), (
        "You should use NatureCNN "
        f"only with images not with {observation_space} "
        "(you are probably using `CnnPolicy` instead of `MlpPolicy`)"
    )

    self.apply_augmentation = apply_augmentation
    self.augmentation = nn.Sequential(K.RandomAffine(degrees=0,
                                                     translate=shift_range,
                                                     scale=zoom_range))

    n_input_channels = observation_space.shape[0]
    self.cnn = nn.Sequential(
        nn.Conv2d(n_input_channels, 32, kernel_size=8, stride=4, padding=0),
        nn.ReLU(),
        nn.Conv2d(32, 64, kernel_size=4, stride=2, padding=0),
        nn.ReLU(),
        nn.Conv2d(64, 64, kernel_size=3, stride=1, padding=0),
        nn.ReLU(),
        nn.Flatten(),
    )

    # Compute shape by doing one forward pass
    with th.no_grad():
        n_flatten = self.cnn(th.as_tensor(observation_space.sample()[None]).float()).shape[1]

    self.linear = nn.Sequential(nn.Linear(n_flatten, features_dim), nn.ReLU())


def forward(self, observations: th.Tensor) -> th.Tensor:

    if self.apply_augmentation:
        observations = self.augmentation(observations)

    # visualise_augmentation(observations)

    cnn_out = self.cnn(observations)
    return self.linear(cnn_out)

'''

Let me know if you think this would be a good fit or if there's any improvements you could think of.

Edit: My bad, hit submit a bit early...

Implement PPO MPI (SB2 PPO1)

MPI can be quite useful to use multiprocessing full potential
but it is dependency that can be tricky to install.

MPIVecEnv

Hello,

I was trying to find a way to make the ARS implementation I was working on in this pr faster. My first thought was a drop in replacement for SubprocVecEnv that uses mpi4py instead. I implemented a first pass here. It is quick and dirty, but still a working proof of concept to see if there is any performance to be gained here.

I am seeing modest speedups in rollout collection time. For pendulum-v0 with 10 environments I am finding it 4-5x faster than Dummy and Subproc. For HumanoidBulletEnv-v0 with 10 environments I am finding it 8x faster than Dummy and 2x Faster than SubProc. It might be possible to squeeze more performance out of it but probably this is 80% of what can be achieved using this approach.

This is just for rollout collections, any actual speedup to algorithms using this vec env are going to be smaller, but for on policy algorithms probably still significant.

I wanted to ask if this or something like it had been considered, IIRC mpi4py was a big headache to support, but perhaps by confining that dependency to contrib/ most of the headache will disappear. Can also look at for example torch distributed, but I think that will cause a similar number of headaches for most likely less speed.

This is another thing I would interested in contributing (over the following weeks ...). But again, only if there is interest.

Tensorboard logging every metrics except rollout/ep_rew_mean

Logging with Tensorboard during training with PPO.learn logs all the metrics except rollout/ep_rew_mean.
The metric(ep_rew_mean) is being printed correctly on the terminal during training.

eval/mean_reward is printed and logged correctly on Tensorboard.

    env = gym.make('gym_ess_grid/ESS-v1')
    check_env(env)

    eval_env = gym.make('gym_ess_grid/ESS-v1')

    eval_callback = EvalCallback(eval_env, best_model_save_path='./logs/',
                          log_path='./logs/', eval_freq=365*24*3*10,
                                 deterministic=True, render=False)
    checkpoint_callback = CheckpointCallback(save_freq=365*24*3*10, save_path=SAVE_DIR.as_posix())

    model = PPO('MlpPolicy', env, verbose=1, tensorboard_log = "./ess_grid_tensorboard/" )
    model.learn(total_timesteps=365*24*3*1000,callback= [checkpoint_callback, eval_callback])

check.env doesn't return any error.

Never had this problem before, with the v0 version of the environment it worked correctly.
The difference are some changes made to the gym env and the fact that now i have in the
/envs folder two python files essgrid.py and essgrid2.py, both gym env are resigistered correctly.

Going back to a version of the repo with only essgrid.py the logging works correctly.
I haven't tried yet to delete essgrid.py and unregister it, I will try it asap when my training session is finished.

System Info

Installed with pip
Stable-Baselines3 == 1.5.0
Gym version == 0.21.0

TRPO Agent not working for multi discrete action space

https://stackoverflow.com/questions/72928772/sb3-contrib-trpo-agent-not-working-for-multi-discrete-action-space

No clear example in doc for using MaskablePPO in custom env

Hi all,
I think this is a problem that there isn't any specific and clear code example for the "masking" part in MaskablePPO. So could you please provide a clear example of how to use "masking" in MaskablePPO? or at least provide a very simple example of a custom environment using MaskablePPO. I can help you by providing a simple example.
Thanks

[feature request] Parameterized action spaces.

Add a new action space for actions with parameters. This paper has an example of parameterized action spaces. The paper also demonstrates the "Q-PAMDP" algorithm with parameterized action spaces.

MaskableEvalCallback doesn't work with StopTrainingOnNoModelImprovement

Describe the bug
MaskableEvalCallback doesn't call StopTrainingOnNoModelImprovement. It's because it's only called on new best mean reward, instead of after every evaluation. In those cases StopTrainingOnNoModelImprovement always resets it's no_improvement_evals, so it never stops the training.
Currently, this is the end of the step of MaskableEvalCallback:
MaskableEvalCallback

            if mean_reward > self.best_mean_reward:
                if self.verbose > 0:
                    print("New best mean reward!")
                if self.best_model_save_path is not None:
                    self.model.save(os.path.join(self.best_model_save_path, "best_model"))
                self.best_mean_reward = mean_reward
                # Trigger callback if needed
                if self.callback is not None:
                    return self._on_event()

Instead it should call self.callback_on_new_best on new best mean reward instead and self.callback after every evaluation:

            continue_training = True

            if mean_reward > self.best_mean_reward:
                if self.verbose > 0:
                    print("New best mean reward!")
                if self.best_model_save_path is not None:
                    self.model.save(os.path.join(self.best_model_save_path, "best_model"))
                self.best_mean_reward = mean_reward
                # Trigger callback on new best model, if needed
                if self.callback_on_new_best is not None:
                    continue_training = self.callback_on_new_best.on_step()

            # Trigger callback after every evaluation, if needed
            if self.callback is not None:
                continue_training &= self._on_event()

            return continue_training

Just like in EvalCallback

Code example

from sb3_contrib import MaskablePPO
from sb3_contrib.common.maskable.callbacks import MaskableEvalCallback
from stable_baselines3.common.callbacks import StopTrainingOnNoModelImprovement
from minigames.collect_minerals_and_gas.src.env_dicrete import CollectMineralAndGasDiscreteEnv


env = CollectMineralAndGasDiscreteEnv(step_mul=8, realtime=False)
stop_callback = StopTrainingOnNoModelImprovement(max_no_improvement_evals=10, min_evals=10, verbose=1)
eval_callback = MaskableEvalCallback(
    env, eval_freq=100000, deterministic=False, render=False, callback_after_eval=stop_callback, n_eval_episodes=20
)

model = MaskablePPO("MlpPolicy", env)
model.learn(10000000, callback=eval_callback)

System Info
Describe the characteristic of your environment:

Describe how the library was installed (pip, docker, source, ...) PIP
Stable-Baselines3 and sb3-contrib versions 1.5.0, contrib: 1.5.0
GPU models and configuration RTX 3070
Python version 3.9.7
PyTorch version 1.11.0+cu113
Gym version 0.21.0

Change from gamma=0.4 to default in Example Docu

Dear sb3-contrib creators, thanks for this awesome repo.

I have just one small suggestion for the Documentation which might have a big impact for the user:
https://github.com/Stable-Baselines-Team/stable-baselines3-contrib/blob/master/docs/modules/ppo_mask.rst

In this Example of MaskablePPO the gamma is set to 0.4
Pasting this snippet to own code and overlooking this drastic hyperparameter setting can easily lead to non-learning models.
I suggest to set gamma to default 0.99 or remove the argument from this example. BR

PPO-Recurrent Bug: Policy Arguments 'shared_lstm' and 'enable_critic_lstm' passed wrongly if using recurrent Cnn/MultiInput policies

Describe the bug
The policy classes RecurrentActorCriticCnnPolicy and RecurrentMultiInputActorCriticPolicy wrongly initialize its parent class 'RecurrentActorCriticPolicy' with regard to the arguments 'shared_lstm' and 'enable_critic_lstm'. The argument 'shared_lstm' is missing in the keyword arguments of the constructor method of both the CNN and Multi-Input policies and is also not passed to the parent class constructor using positional arguments. Thus the argument 'enable_critic_lstm' (as passed by policy_kwargs from the model class) is passed on the position of the argument 'shared_lstm' to the constructor method of 'RecurrentActorCriticPolicy'. Thus, when using one of the two subclass policies, the 'enable_critic_lstm' argument actually controls the 'shared_lstm' setting in the policy initialization, while 'shared_lstm' is either None or when passed, an error occurs.

Code example

from sb3_contrib import RecurrentPPO

model = RecurrentPPO(
    "CnnLstmPolicy",
    "CarRacing-v0",
    verbose=1,
    policy_kwargs=dict(shared_lstm=True, enable_critic_lstm=False),
)
model.learn(5000)

This code gives the error:

File ".../stable-baselines3-contrib/sb3_contrib/ppo_recurrent/ppo_recurrent.py", line 147, in _setup_model
    self.policy = self.policy_class(
TypeError: __init__() got an unexpected keyword argument 'shared_lstm'

This bug can be fixed by adding the 'shared_lstm' parameter in four positions, two for each of the subclass policies.

question regarding qrdqn

Hi,
while trying to extend qrdqn to support double qrdqn, I couldn't convince myself with the current implementation of the update function.

specifically, refer to the train function in QRDQN class (qrdqn.py), line 165:

                # Follow greedy policy: use the one with the highest value
                next_quantiles, _ = next_quantiles.max(dim=2)

it looks like the max operation is done on the actions dimensions and by that per each quantile separately, the value from the action that yielded maximum value. (e.g. for the first quantile it can take from a=1, for the 2nd quantile it can take from a=4 etc.)
so the resulting buffer will include quantile values from different actions.

As I have understood (and checked various other implementations of qrdqn), the intent was to calculate the q value of each of the actions by averaging over the quantiles of each action and only then take the quantiles of the best next action.
something like this:

            with th.no_grad():
                # Compute the quantiles of next observation
                next_quantiles = self.quantile_net_target(replay_data.next_observations)
                best_next_actions = next_quantiles.mean(dim=1).argmax(dim=1,keepdim=True)
                actions_index = best_next_actions[...,None].expand(batch_size,self.n_quantiles,1)
                next_quantiles = target_quantiles.gather(dim=2,index=actions_index).squeeze(dim=2)
                # 1-step TD target
                target_quantiles = replay_data.rewards + (1 - replay_data.dones) * self.gamma * next_quantiles

My question : is the current implementation differs by design ? if so, and assuming the implementation that I'm proposing here corresponds to the publication, what's the justification of the variant currently implemented ?

BTW, Great work with SBL3 !! a very useful and easy to work with library !
I mainly use it (and thus extend it) for offline RL settings.

Thanks a lot :)

Maskable RecurrentPPO

Hi!

I was wondering whether a Maskable Recurrent PPO would be laborious to implement?
Our model choses illigal actions during training and evaluation but we would really like to keep using the Recurrent policy.

Kind regards,
Dylan Prins

ARS calling reset twice

Describe how the library was installed: pip

sb3-contrib==1.5.1a9
Python: 3.9.12
Stable-Baselines3: 1.5.1a8
PyTorch: 1.11.0
GPU Enabled: False
Numpy: 1.22.4
Gym: 0.21.0

I'm using a custom environment that tracks the episode number, with 1 added to env.episode upon reset. With literally every other model, this works fine, but in ARS, reset seems to be called twice in a row, so instead of episode 10 showing as episode 10, it will show as episode 20. It is more of an inconvenience than anything else, and my workaround has been to just update the episode number when done == True instead, but I wasn't sure if reset being called twice is impacting the observations, etc. that the model is receiving.

[feature request] IQN and FQF

[Feature request] can we expect an implementation of IQN and FQF soon as we have a very nice implementation of QRDQN

MaskableTrialEvalCallback - where to contribute

Hi,
I've created MaskableTrialEvalCallback, which combines TrialEvalCallback and MaskableEvalCallback. Those 2 are in different repositories, so I don't know in to which one I could contribute this one?

Does it makes sense to move callbacks.py and wrappers from rl-baselines3-zoo to the contrib repo? I think there is a consensus that hyper-parameter search is essential in RL. Then it'd make sense to contribute MaskableTrialEvalCallback to this repo.

Allow PPO to turn off advantage normalization

🚀 Feature

Allow PPO to turn of advantage normalization. Follow up from DLR-RM/stable-baselines3#763

Motivation

In SB3 we can turn off the advantage normalization in A2C via normalize_advantage=False but not in PPO.

Pitch

Add normalize_advantage flag in PPO's parameter

Alternatives

Remove A2C entirely and reproduce A2C by using PPO.

[Feature Request] MaskableRecurrentPPO

Motivation
MaskablePPO is great for large discrete action space that has many invalid actions at each step, while RecurrentPPO is useful for the agent to has a memory of previous observations and actions taken, which improves it's decision making. Right now, we have to choose between those 2 algorithms and cannot have features of both of them, which would greatly improve agents training when both action masking and sequence processing is helpful.

Feature
MaskableRecurrentPPO - An algorithm that is a combination of MaskablePPO and RecurrentPPO. Or action masking integration to PPO and RecurrentPPO.

a question regarding RecurrentPPO

Hi, I'm using RecurrentPPO to train an RecurrentActorCriticPolicy.

I noticed that when collecting rollouts data, the hidden states in LSTM at each time steps are also storaged in the rollout buffer:

rollout_buffer.add(
    self._last_obs,
    actions,
    rewards,
    self._last_episode_starts,
    values,
    log_probs,
    lstm_states=self._last_lstm_states,
)

Then in during the training progress, the values and the log_prob of the actions under the current policy are directly given by regarding rollout_data.lstm_states as inputs:

values, log_prob, entropy = self.policy.evaluate_actions(
    rollout_data.observations,
    actions,
    rollout_data.lstm_states,
    rollout_data.episode_starts,
)

Now that the lstm_states come directly from the buffer, rather than being computed from the start, doesn't that mean that the backpropagation though time procedure merely goes back for one step? More precisely, the require_grad property of rollout_data.lstm_states equals False, is that reasonable?

I'm not sure if there's anything I'm missing. Hope for some replies.

Thanks.

Installing an earlier version of stable-baselines3-contrib (version 1.0) results in endless backtracking of some dependencies

Installing an earlier version of stable-baselines3-contrib (version 1.0) results in endless backtracking of some dependencies. SB3 1.0 was previously installed before running this operation.

Tried it on google colab, with GPU (but previously got this error on a linux server as well)

!pip install sb3-contrib==1.0

INFO: pip is looking at multiple versions of coverage to determine which version is compatible with other requirements. This could take a while.
Collecting coverage>=5.2.1
  Using cached coverage-5.4-cp37-cp37m-manylinux2010_x86_64.whl (242 kB)
  Using cached coverage-5.3.1-cp37-cp37m-manylinux2010_x86_64.whl (242 kB)
  Using cached coverage-5.3-cp37-cp37m-manylinux1_x86_64.whl (229 kB)
  Using cached coverage-5.2.1-cp37-cp37m-manylinux1_x86_64.whl (229 kB)
INFO: pip is looking at multiple versions of pytest-cov to determine which version is compatible with other requirements. This could take a while.
Collecting pytest-cov
  Using cached pytest_cov-2.12.0-py2.py3-none-any.whl (20 kB)
INFO: pip is looking at multiple versions of coverage[toml] to determine which version is compatible with other requirements. This could take a while.
INFO: pip is looking at multiple versions of coverage to determine which version is compatible with other requirements. This could take a while.
  Using cached pytest_cov-2.11.1-py2.py3-none-any.whl (20 kB)
INFO: This is taking longer than usual. You might need to provide the dependency resolver with stricter constraints to reduce runtime. If you want to abort this run, you can press Ctrl + C to do so. To improve how pip performs, tell us what happened here: https://pip.pypa.io/surveys/backtracking
  Downloading pytest_cov-2.11.0-py2.py3-none-any.whl (20 kB)

[Feature Request] Implement TRPO

Hi,

I've started working on implementing TRPO: https://github.com/cyprienc/stable-baselines3-contrib/blob/master/sb3_contrib/trpo/trpo.py

I am currently facing a bug when computing the step direction and maximal step length using the matrix-vector product with the Fisher information matrix.
The denominator of the beta is sometimes negative.

I suspect the Hessian in the Hessian-vector product used for the Conjugate Gradient algorithm is wrong (see implementation):

def Hpv(v, retain_graph=True):
  jvp = (grad_kl * v).sum()
  return flat_grad(jvp, params, retain_graph=retain_graph)

Could I've made the graph to compute grad_kl wrong?
If someone spots something out of place, please let me know.
Thanks,

Cyprien

PS: Here is a snippet to run the code:

import gym

from sb3_contrib.trpo.policies import MlpPolicy
from sb3_contrib.trpo.trpo import TRPO

env = gym.make("CartPole-v1")

model = TRPO(MlpPolicy, env, verbose=1)
model.learn(total_timesteps=10000)

obs = env.reset()
for i in range(1000):
    action, _states = model.predict(obs)
    obs, rewards, dones, info = env.step(action)
    env.render()

env.close()

[Feature Request] Better support for action masking for vectorized environments

Motivation
Stable-baselines3 (SB3) has introduced support for action masking (see here), which is a great feature. However, this API requires the user to provide an ActionMasker wrapper. The issue is that some environments (e.g., gym-microrts, pettingzoo) directly provides a vectorized interface, so there is no chance to apply this ActionMasker.

Feature
Extending MaskablePPO to work with vectorized environments natively. With this extension, SB3 + PPO + Mask can work in gym-microrts and pettingzoo.

[question] Question regarding RecurrentPPO and the magnitude of approx_kl / clip_fraction / entropy_loss

While using RecurrentPPO, I noticed rather high values for approx_kl and clip_fraction. On further investigation it also seems that the entropy_loss indicates that the policy does not start from a (mostly) random policy.

More concretely, I used RecurrentPPO and PPO (as a reference) to solve the LunarLander-v2 environment. Since this has an action space of four discrete actions we expect the starting (random) policy to have an entropy close to $-\log{1/4} \approx 1.39$. Which happens with PPO, but the RecurrentPPO run it starts off lower, which seems odd to me.

But more apparent is the difference in magnitude of the approx_kl for the two runs.

Question
Is the magnitude difference of the approx_kl and starting point of the entropy_loss expected behaviour?

Code example
I used the following (pseudo) code to perform the test, while not one-to-one, I tried to get the model architectures to be roughly similar.

vec_env = DummyVecEnv([lambda: Monitor(gym.make("LunarLander-v2"), filename = None) for _ in range(32)])
vec_env = VecNormalize(vec_env, norm_reward = True, gamma = .999)

policy_kwargs = dict( net_arch = [64, 64])
lstm_policy_kwargs = dict( net_arch = [64],
    lstm_hidden_size = 64, n_lstm_layers = 1, shared_lstm = True, enable_critic_lstm = False)

model = (Recurrent)PPO('Mlp(Lstm)Policy', vec_env, policy_kwargs = (lstm_)policy_kwargs,
    n_steps = 512,  batch_size = 128, n_epochs = 4,
    ent_coef = 0.01, vf_coef = .5, gamma = 0.999, gae_lambda = .98,
    normalize_advantage = True, 
    verbose = 1, tensorboard_log = PATH + '/tensorboard')

TensorBoard results
Here the orange run corresponds to PPO, while the blue run corresponds to RecurrentPPO

Implement D4PG

Distributed Distributional Deterministic Policy Gradients

Reference implementation:

https://github.com/deepmind/acme

PyTorch implementation:

https://github.com/fabiopardo/tonic

MISTER @Twitter + coinmarketcap.com

Important Note: We do not do technical support, nor consulting and don't answer personal questions per email.
Please post your question on reddit or stack overflow in that case.

If you have any questions, feel free to create an issue with the tag [question].
If you wish to suggest an enhancement or feature request, add the tag [feature request].
If you are submitting a bug report, please fill in the following details.

If your issue is related to a custom gym environment, please check it first using:

from stable_baselines3.common.env_checker import check_env

env = CustomEnv(arg1, ...)
# It will check your custom environment and output additional warnings if needed
check_env(env)

Describe the bug
A clear and concise description of what the bug is.

Code example
Please try to provide a minimal example to reproduce the bug. Error messages and stack traces are also helpful.

Please use the markdown code blocks
for both code and stack traces.

from stable_baselines3 import ...
from sb3_contrib import ...

Traceback (most recent call last): File ...

System Info
Describe the characteristic of your environment:

Describe how the library was installed (pip, docker, source, ...)
Stable-Baselines3 and sb3-contrib versions
GPU models and configuration
Python version
PyTorch version
Gym version
Versions of any other relevant libraries

Additional context
Add any other context about the problem here.

@twitter
Goldlover account was a hint!Thnks Old Friend!!!

LOGED IN TODAY WWW.KEYBASE.IO FROM 2009 UNTIL NOW

COINMARKETCAP.COM MUST DELIVER TOO BLOCKCHAIN 980000 BTC BECAUSE WAS LOSTED IN 2009 IN DAMEGED LAPTOP , FINDLY THEY OWN THE PRIVATE KEY AND ALL COIN NAMES !!!

Implement Truncated Quantile Critics (TQC)

Issue moved from SB3: DLR-RM/stable-baselines3#83

I'm normally against implementing very recent papers before they prove to be valuable but I would like to make an exception for that one, especially because of the good results. It was recently accepted at ICML 2020.

Paper: Controlling Overestimation Bias with Truncated Mixture of Continuous Distributional Quantile Critics
Code: https://github.com/bayesgroup/tqc_pytorch

Background

This paper build on SAC, TD3 and QR-DQN, making use of quantile regression to predict a distribution for the value function (instead of a mean value).
It truncates the quantiles predicted by different networks (a bit as it is done in TD3).
This is for continuous actions only.

Pros

I already implemented it in SB3 (https://github.com/DLR-RM/stable-baselines3/tree/feat/tqc), it was pretty straightforward as I'm using SAC code for the backbone (I did not remove the duplicated code yet) and the authors code for the loss. The difference between SAC and TQC is 30 lines (15 for the loss and 15 for the critic code).
And using SAC hyperparameters from the zoo, I could achieve very good results on Pybullet env and on BipedalWalkerHardcore (for this env it reaches maximal performance 10x faster than my previous experiments).
The good news is that SAC hyperparameters are transferable to this new algorithm.

The loss function can be re-used to implemented QR-DQN which is apparently a huge improvement over DQN (with minimal effort).
The author code is only both in Tensorflow and Pytorch and the results are really good.

Cons

it adds a bit of complexity / duplication but this can be mitigated if it derives from SAC class.

[Question] TQC's CPU Usage

I have some general questions about why CPU usage might spike when resuming training from a checkpoint:

Upon resuming training of a TQC agent from a saved checkpoint, we see that CPU usage is very high when the resumed training job has a batch size nearly double that of the training job that produced the checkpoint.

More concretely, suppose a training job was run with a batch size of 256 and called TQC.save('checkpoint1') at some point during training. Then we resume training starting at TQC.load('checkpoint1') in two different jobs: (1) with batch size of 384 and (2) with batch size of 512. Our experience is that resumed job (2) with the batch size of 512 shows > 10x the CPU utilization as the original job and resumed job (1) with the batch size of 384:

The original job and the resumed jobs have similar GPU usage
If the resumed job isn't cpu bound, the training speed doesn't appear to change either

I am curious if there is anything about TQC specifically that would cause this type of behaviour, or if this is more likely to be caused by something at the pytorch level?

Sorry if this question doesn't belong here and thanks for all your work on stable-baselines3 and sb3-contrib!

[BUG] action masking does not work with VecEnv and MultiDiscrete action space

Describe the bug
I am aware of #49 (comment) - but it still does not work. I have investigated the code and this is what I found:

When having more than one environment, each using their own ActionMasker, the masks get collected in batch form, thus splitting the masks across the distributions does not work. This feels to me like a VecEnv bug, however, I followed the advice in the documentation and comments on how to set up the action masker on an env-individual basis.

stable-baselines3-contrib/sb3_contrib/common/maskable/distributions.py

Line 234 in 75b2de1

split_masks = th.split(masks, tuple(self.action_dims), dim=1)

My ActionSpace is for example Multidiscrete([5]*72). I am spinning up 128 environments. (Fyi: 5*72 = 360)
When investigating the MaskableMultiCategoricalDistribution it actually creates 72 MaskableCategorical distributions, as it should.
BUT: the shape of the mask is not (360,) or (1,360) but instead it is (128, 360). This way the masks get split weirdly. and the above-mentioned line as well as the distributions are not built for it AFAIK. When tracking invalid actions taken in my environment, there are a ton instead of the expected 0.

System Info
Describe the characteristic of your environment:

Describe how the library was installed: pip
stable-baselines3==1.4.0, sb3-contrib==1.4.0
A100 & AMD EPYC (16 cores)
Python version 3.9.2
PyTorch version 1.11.0+cu113
Gym version gym==0.19.0
Numpy: 1.22.2

Am I doing something wrong or are there further ways I can debug this?

Implement PPG

As discussed in sb3#346, I'd like to merge an existing implementation of the PPG algorithm.

I'm unsure about the two SDE-related calls here and here; I just oriented myself on PPO, calling it before the policy is used for its forward pass (sadly haven't gotten to read your nice paper yet ;) ).
Two other things worth pointing out:

I added a use_paper_parameters flag.
It's possible to initialize the policy weights the same way the paper did. I could put this behind a policy_kwarg and overwrite the init_weights method at runtime.

I also changed the line length to be PEP-compliant instead of having your desired line length of 127; I hope make format resets my changes.
Finally, I will put in some comments clearly separating the different phases in the algorithm.

Other than that, there are

documentation and
tests left to write, as well as
reproducing the paper.

I can't reproduce the paper results due to space issues on my laptop, sadly. Will see if I can get to it on the weekend; if it's not possible for me, I hope a helpful someone can finish this for me. I will at least provide a code baseline for this as well.

Cannot run example of Recurrent PPO

System Info
Describe the characteristic of your environment:

Describe how the library was installed (pip, docker, source, ...) PIP
Stable-Baselines3 version git+https://github.com/DLR-RM/stable-baselines3@ef332fa50d635e28a1fdee227fe91b5fdb05877d
sb3-contrib version 1.6.2
Python version 3.10.6
PyTorch version 1.12.1
Gym version 0.21.0

Code Example: Recurrent PPO

import numpy as np

from sb3_contrib import RecurrentPPO

model = RecurrentPPO("MlpLstmPolicy", "CartPole-v1", verbose=1)
model.learn(5000)

env = model.get_env()
obs = env.reset()
# cell and hidden state of the LSTM
lstm_states = None
num_envs = 1
# Episode start signals are used to reset the lstm states
episode_starts = np.ones((num_envs,), dtype=bool)
while True:
    action, lstm_states = model.predict(obs, state=lstm_states, episode_start=episode_starts, deterministic=True)
    obs, rewards, dones, info = env.step(action)
    episode_starts = dones
    env.render()

Describe the bug
I tried running the Recurrent PPO example from the documentation and hit a TypeError with unexpected keyword argument 'create_eval_env'. Most likely related to #105 with deprecated parameters. I would appreciate any help. Thanks so much!

Traceback (most recent call last):
  File "C:\Users\witten_goat\Documents\vectorRPE\deepRL\sb3_example.py", line 5, in <module>
    model = RecurrentPPO("MlpLstmPolicy", "CartPole-v1", verbose=1)
  File "C:\Users\witten_goat\Anaconda3\envs\thesis\lib\site-packages\sb3_contrib\ppo_recurrent\ppo_recurrent.py", line 105, in __init__
    super().__init__(
TypeError: OnPolicyAlgorithm.__init__() got an unexpected keyword argument 'create_eval_env'

Contributing ARS implementation.

Moving DLR-RM/stable-baselines3#565 to here per request.

🚀 Feature

I would like to contribute an ARS implementation to SB3.

Motivation

BRS/ARS is a dead simple algorithm that works more often than most people would expect. I wish more people would try it on their custom environments (and maybe realize they don't really need DRL at all, or that their environment is too easy etc).

Pitch

I contribute a version of ARS that is compatible with and to up to the standards of SB3.

I have a version I made for myself here, and there are other implementations I can compare performance to. ARS is very simple, though some care will be needed to find the most natural / least intrusive way to incorporate it into the existing code base. If you are down with the general idea I can come up with a proposal on that front as well.

I'd also like to run and contribute ARS agents to the sb3-zoo.

I'm happy to do all this with minimal help, but only if the team actually thinks this is useful and will merge it.

Alternatives

~~Add this to the sb3-contrib repo instead. I don't think ARS is super cutting edge or fancy, but possibly the team would prefer I put it there anyway.~~
I don't write this algorithm at all and continue using my own version and grumbling that people aren't benchmarking against this simple baseline.

Additional context

Checklist

I have checked that there is no similar issue in the repo (required)

[feature request] new logger contribution

Hi,

Thanks for the library and all the effort you put into it!

Together with my friend we wanted to contribute new logger to the Stable Baselines, namely neptune-logger.

It will let stable baselines and neptune users automatically log their runs to Neptune (it is experiment tracking tool).

Can I open draft PR, so that we can continue working on it from there?

Please let me know what do you think this idea?

Verbose parameter is not handled in MaskablePPO evaluation.

Describe the bug
In the MaskablePPO, MaskableEvalCallback does not handles the verbose parameter. With this issue and the following PR, I suggest passing the self.verbose variable into this constructor.

This issue is created after @qgallouedec 's suggestion in DLR-RM/stable-baselines3#1011 (comment)

stable-baselines3-contrib/sb3_contrib/ppo_mask/ppo_mask.py

Lines 206 to 214 in fc68af8

 eval_callback = MaskableEvalCallback( 

 eval_env, 

 best_model_save_path=log_path, 

 log_path=log_path, 

 eval_freq=eval_freq, 

 n_eval_episodes=n_eval_episodes, 

 use_masking=use_masking, 

 ) 

 callback = CallbackList([callback, eval_callback])

System Info

({'OS': 'Linux-5.13.0-40-generic-x86_64-with-glibc2.17 #45~20.04.1-Ubuntu SMP Mon Apr 4 09:38:31 UTC 2022', 'Python': '3.8.13', 'Stable-Baselines3': '1.6.1a0', 'PyTorch': '1.11.0', 'GPU Enabled': 'True', 'Numpy': '1.22.3', 'Gym': '0.25.0'}, 'OS: Linux-5.13.0-40-generic-x86_64-with-glibc2.17 #45~20.04.1-Ubuntu SMP Mon Apr 4 09:38:31 UTC 2022\nPython: 3.8.13\nStable-Baselines3: 1.6.1a0\nPyTorch: 1.11.0\nGPU Enabled: True\nNumpy: 1.22.3\nGym: 0.25.0\n')

[feature request] DQN Clipped and DQN Reg Algorithms

The paper 'Evolving RL Algorithms' (https://arxiv.org/abs/2101.03958) uses evolution strategies to find new modifications of DQN. The paper reports the two best found algorithms DQN Clipped and DQN Reg. I modified the main stable baselines 3 to DQN implementation to have those modifications. Is there any interest in me formally implementing DQN Clipped and DQN Reg through SB3 Contrib and doing a pull request? It's kind of an obscure paper and the modifications are kind of slight so I'm not sure if there's much interest in it.

The paper calls DQN Clipped and Reg new algorithms but they are basically slight modifications to the DQN loss function (example change for DQN Reg replacying the DQN's Huber Loss: https://github.com/AurelianTactics/dqnclipped_dqnreg_prelim_implementation/blob/main/utils.py#L382). My proposed implementation:

Create a new learning algorithm. Basically DQN with two new hyperparameters. One to decide on the loss type (standard, Clipped, Reg), and the other for the DQN Reg hyperparameter.
Test the results on the four classic control algorithms reported and plotted in the paper: CartPole, Acrobot, LunarLander, and MountainCar. Run 5 random seeds and use the EvalCallback to get the numpy results, plot those results against the paper's plots to compare. The paper also tested against MiniGrid and did some Atari tests for DQN Reg but I'd prefer if the four Classic Control results are enough.

Aside note: is there an example using EvalCallback for Atari for SB3 (main repo not contrib)? I was able to get EvalCallback working eventually for Atari but I had to modify some code in EvalCallback. The SB3 code would take my train env and run it through VecTransposeImage. However the eval env would be run through EvalCallback and not get the same change so I would get a warning and then an error. I fixed it by having EvalCallback do the same check to see if a VecTransposeImage was needed and then do the transpose (if needed) but I feel like there's probably a more standard, user friendly way.

[Feature Request] RCE in Stable Baselines 3

🚀 Feature
Implementation of recursive classification of examples (RCE) in SB3. According to the authors, this method can be implemented on top of any actor-critic method, such as SAC or TD3 (which are already part of SB3).

Motivation
Example-based policy search seems to be a very promising path for RL.

Pitch
Enable the use of algorithms that learn from successful examples, without the need for expert trajectories or rewards.

Alternatives
Maybe having consolidated imitation learning or inverse-RL algorithms would be enough.

Additional context
Paper, code, and post available here

Implement MPO

Maximum a Posteriori Policy Optimisation (MPO)

Reference implementation:

https://github.com/deepmind/acme

PyTorch implementation:

https://github.com/fabiopardo/tonic

[Bug] An error in MaskPPO training

System Info
Describe the characteristic of your environment:

Describe how the library was installed: pip
sb3-contrib=='1.5.1a9'
Python: 3.8.13
Stable-Baselines3: 1.5.1a9
PyTorch: 1.11.0+cu102
GPU Enabled: False
Numpy: 1.22.3
Gym: 0.21.0

My training code as below:
model = MaskablePPO("MultiInputPolicy", env, gamma=0.4, seed=32, verbose=0)
model.learn(300000)
My action space is spaces.Discrete() . It seems a problem in torch distribution init(), the input logits had invalid value. And the error happened at uncertain training step.

File ~/anaconda3/envs/stable_base/lib/python3.8/site-packages/sb3_contrib/ppo_mask/ppo_mask.py:579, in MaskablePPO.learn(self, total_timesteps, callback, log_interval, eval_env, eval_freq, n_eval_episodes, tb_log_name, eval_log_path, reset_num_timesteps, use_masking)
576 self.logger.record("time/total_timesteps", self.num_timesteps, exclude="tensorboard")
577 self.logger.dump(step=self.num_timesteps)
--> 579 self.train()
581 callback.on_training_end()
583 return self

File ~/anaconda3/envs/stable_base/lib/python3.8/site-packages/sb3_contrib/ppo_mask/ppo_mask.py:439, in MaskablePPO.train(self)
435 if isinstance(self.action_space, spaces.Discrete):
436 # Convert discrete action from float to long
437 actions = rollout_data.actions.long().flatten()
--> 439 values, log_prob, entropy = self.policy.evaluate_actions(
440 rollout_data.observations,
441 actions,
442 action_masks=rollout_data.action_masks,
443 )
445 values = values.flatten()
446 # Normalize advantage

File ~/anaconda3/envs/stable_base/lib/python3.8/site-packages/sb3_contrib/common/maskable/policies.py:280, in MaskableActorCriticPolicy.evaluate_actions(self, obs, actions, action_masks)
278 distribution = self._get_action_dist_from_latent(latent_pi)
279 if action_masks is not None:
--> 280 distribution.apply_masking(action_masks)
281 log_prob = distribution.log_prob(actions)
282 values = self.value_net(latent_vf)

File ~/anaconda3/envs/stable_base/lib/python3.8/site-packages/sb3_contrib/common/maskable/distributions.py:152, in MaskableCategoricalDistribution.apply_masking(self, masks)
150 def apply_masking(self, masks: Optional[np.ndarray]) -> None:
151 assert self.distribution is not None, "Must set distribution parameters"
--> 152 self.distribution.apply_masking(masks)

File ~/anaconda3/envs/stable_base/lib/python3.8/site-packages/sb3_contrib/common/maskable/distributions.py:62, in MaskableCategorical.apply_masking(self, masks)
59 logits = self._original_logits
61 # Reinitialize with updated logits
---> 62 super().init(logits=logits)
64 # self.probs may already be cached, so we must force an update
65 self.probs = logits_to_probs(self.logits)

File ~/anaconda3/envs/stable_base/lib/python3.8/site-packages/torch/distributions/categorical.py:64, in Categorical.init(self, probs, logits, validate_args)
62 self._num_events = self._param.size()[-1]
63 batch_shape = self._param.size()[:-1] if self._param.ndimension() > 1 else torch.Size()
---> 64 super(Categorical, self).init(batch_shape, validate_args=validate_args)

File ~/anaconda3/envs/stable_base/lib/python3.8/site-packages/torch/distributions/distribution.py:55, in Distribution.init(self, batch_shape, event_shape, validate_args)
53 valid = constraint.check(value)
54 if not valid.all():
---> 55 raise ValueError(
56 f"Expected parameter {param} "
57 f"({type(value).name} of shape {tuple(value.shape)}) "
58 f"of distribution {repr(self)} "
59 f"to satisfy the constraint {repr(constraint)}, "
60 f"but found invalid values:\n{value}"
61 )
62 super(Distribution, self).init()

ValueError: Expected parameter probs (Tensor of shape (64, 400)) of distribution MaskableCategorical(probs: torch.Size([64, 400]), logits: torch.Size([64, 400])) to satisfy the constraint Simplex(), but found invalid values:

Update MaskablePPO with latest improvements from SB3

Additional methods (get_distribution, ...)
Fixed FPS computation issue

Implement QR-DQN

Paper: https://arxiv.org/abs/1710.10044

@ku2482 Please tell us if you want to work on that ;)

	eval_callback = MaskableEvalCallback(
	eval_env,
	best_model_save_path=log_path,
	log_path=log_path,
	eval_freq=eval_freq,
	n_eval_episodes=n_eval_episodes,
	use_masking=use_masking,
	)
	callback = CallbackList([callback, eval_callback])

stable-baselines-team / stable-baselines3-contrib Goto Github PK

stable-baselines3-contrib's People

Contributors

Stargazers

Watchers

Forkers

stable-baselines3-contrib's Issues

🚀 Feature

Motivation

Alternatives

Additional context

Checklist

🚀 Feature

Motivation

Checklist

🚀 Feature

Motivation

Pitch

Alternatives

Background

Pros

Cons

🚀 Feature

Motivation

Pitch

Alternatives

Additional context

Checklist

Recommend Projects

Recommend Topics

Recommend Org

Jobs