humancompatibleai / seals Goto Github PK

Benchmark environments for reward modelling and imitation learning algorithms.

License: MIT License

Dockerfile 2.08% Shell 0.76% Python 97.16%

seals's Introduction

Status: early beta.

seals, the Suite of Environments for Algorithms that Learn Specifications, is a toolkit for evaluating specification learning algorithms, such as reward or imitation learning. The environments are compatible with Gym, but are designed to test algorithms that learn from user data, without requiring a procedurally specified reward function.

There are two types of environments in seals:

Diagnostic Tasks which test individual facets of algorithm performance in isolation.
Renovated Environments, adaptations of widely-used benchmarks such as MuJoCo continuous control tasks and Atari games to be suitable for specification learning benchmarks. In particular, we remove any side-channel sources of reward information from MuJoCo tasks, and give Atari games constant-length episodes (although most Atari environments have observations that include the score).

seals is under active development and we intend to add more categories of tasks soon.

You may also be interested in our sister project imitation, providing implementations of a variety of imitation and reward learning algorithms.

Check out our documentation for more information about seals.

Quickstart

To install the latest release from PyPI, run:

pip install seals

All seals environments are available in the Gym registry. Simply import it and then use as you would with your usual RL or specification learning algroithm:

import gymnasium as gym
import seals

env = gym.make('seals/CartPole-v0')

We make releases periodically, but if you wish to use the latest version of the code, you can install directly from Git master:

pip install git+https://github.com/HumanCompatibleAI/seals.git

Contributing

For development, clone the source code and create a virtual environment for this project:

git clone [email protected]:HumanCompatibleAI/seals.git
cd seals
./ci/build_venv.sh
pip install -e .[dev]  # install extra tools useful for development

Code style

We follow a PEP8 code style with line length 88, and typically follow the Google Code Style Guide, but defer to PEP8 where they conflict. We use the black autoformatter to avoid arguing over formatting. Docstrings follow the Google docstring convention defined here, with an extensive example in the Sphinx docs.

All PRs must pass linting via the ci/code_checks.sh script. It is convenient to install this as a commit hook:

ln -s ../../ci/code_checks.sh .git/hooks/pre-commit

Tests

We use pytest for unit tests and codecov for code coverage. We also use pytype and mypy for type checking.

Workflow

Trivial changes (e.g. typo fixes) may be made directly by maintainers. Any non-trivial changes must be proposed in a PR and approved by at least one maintainer. PRs must pass the continuous integration tests (CircleCI linting, type checking, unit tests and CodeCov) to be merged.

It is often helpful to open an issue before proposing a PR, to allow for discussion of the design before coding commences.

Citing seals

To cite this project in publications:

 @misc{seals,
   author = {Adam Gleave and Pedro Freire and Steven Wang and Sam Toyer},
   title = {{seals}: Suite of Environments for Algorithms that Learn Specifications},
   year = {2020},
   publisher = {GitHub},
   journal = {GitHub repository},
   howpublished = {\url{https://github.com/HumanCompatibleAI/seals}},
}

seals's People

Contributors

Stargazers

Watchers

Forkers

pedrofreire fratim dfilan rocamonde stewy33 ianyfan pavelcz

seals's Issues

Multiple absorbing states for `AbsorbAfterDoneWrapper`?

I am wondering when do we need multiple absorbing states? Shouldn't we only need one absorbing state?

In AbsorbAfterDoneWrapper, if we dont specify absorb_obs, then the absorbing state will be changing every episode.
In case where there is a "fixed" terminal state, then the absorbing state will only be one. But say the terminal state is "not fixed" (eg. done=True when the |s - g| < epsilon), then the absorbing state will be changing every episode.

If having multiple absorbing states is unintentional, the easy fix would be just "latch" the absorbing state to be the absorbing state of the first episode.

Make wrapper base class inheriting from `ResettablePOMDP`

We should have a base wrapper class that inherits from ResettablePOMDP so that subclasses of this environment can be wrapped type-safely.

gymnasium compatibility of seals

As the gym is deprecated by now we should move on to support gymnasium.

Does not work with the latest mujoco 3.1.1

When trying to load the swimmer environment, I get a

ValueError: XML Error: Schema violation: unrecognized attribute: 'collision'

error when mujoco 3.1.1 is installed. Downgrading to mujoco==2.3.7 fixes the issue.

-> The dependencies in setup.py are not pinned or constrained at all. Maybe adding a mujoco=~2 will fix this.

Add PyPI release automation

Strange non-type-error

In base_envs.py, the class TabularModelPOMDP has a method obs_dtype that returns the data type of observation vectors - specifically, it returns self.observation_matrix.dtype. However, it is typed as returning an int. When I run code checks without first having first activated the virtual environment, mypy reports a type error. However, when I first activate the virtual environment, mypy doesn't complain at all. At first glance, this would certainly appear to be a type error, but maybe it isn't for some reason?

Importing `seals.testing.envs` errors when `mujoco_py` isn't installed

This makes imitation testing break on a pip install -e .[dev,test] install (not on our CircleCI instance, which comes with MuJoCo).

This isn't a high-priority thing for me, but just wanted to document this problem having run into it.

During handling of the above exception, another exception occurred:
tests/test_envs.py:5: in <module>
    from seals.testing import envs as seals_test
env/lib/python3.7/site-packages/seals/testing/envs.py:21: in <module>
    from gym.envs.mujoco import mujoco_env
env/lib/python3.7/site-packages/gym/envs/mujoco/__init__.py:1: in <module>
    from gym.envs.mujoco.mujoco_env import MujocoEnv
env/lib/python3.7/site-packages/gym/envs/mujoco/mujoco_env.py:14: in <module>
    raise error.DependencyNotInstalled("{}. (HINT: you need to install mujoco_py, and also perform the setup instructions here: https://github.com/openai/mujoco-py/.)".format(e))
E   gym.error.DependencyNotInstalled: No module named 'mujoco_py'. (HINT: you need to install mujoco_py, and also perform the setup instructions here: https://github.com/openai/mujoco-py/.)

Replace AutoResetWrapper by a HideTerminationWrapper

gymnasium as an AutoResetWrapper that behaves close to ours but it does not hide the termination conditions. We should use the upstream AutoResetWrapper and just provide one that does nothing but hide the terminated/truncated signals and use that in combination with the gymnasium.AutoResetWrapper.

seals Cartpole observation space doesn't match vanilla CartPole's observation space

INFO - root - Loading Stable Baselines policy for '<class 'stable_baselines3.ppo.ppo.PPO'>' from 'data/expert_models/seals_cartpole_0/policies/final/'
ERROR - expert_demos - Failed after 0:00:03!
Traceback (most recent calls WITHOUT Sacred internals):
  File "/home/steven/PycharmProjects/imitation-satellite/env/lib/python3.7/site-packages/wrapt/wrappers.py", line 564, in __call__
    args, kwargs)
  File "/home/steven/PycharmProjects/imitation-satellite/src/imitation/scripts/expert_demos.py", line 218, in rollouts_from_policy
    policy = serialize.load_policy(policy_type, policy_path, venv)
  File "/home/steven/PycharmProjects/imitation-satellite/src/imitation/policies/serialize.py", line 198, in load_policy
    return agent_loader(policy_path, venv)
  File "/home/steven/PycharmProjects/imitation-satellite/src/imitation/policies/serialize.py", line 140, in f
    model = cls.load(model_path, env=venv)
  File "/home/steven/PycharmProjects/imitation-satellite/env/lib/python3.7/site-packages/stable_baselines3/common/base_class.py", line 597, in load
    check_for_correct_spaces(env, data["observation_space"], data["action_space"])
  File "/home/steven/PycharmProjects/imitation-satellite/env/lib/python3.7/site-packages/stable_baselines3/common/utils.py", line 206, in check_for_correct_spaces
    raise ValueError(f"Observation spaces do not match: {observation_space} != {env.observation_space}")
ValueError: Observation spaces do not match: Box(4,) != Box(4,)

When I'm trying to generate rollouts in imitation inside the seals/cartpole environment using a vanilla CartPole policy.

Seals Atari environments show game score

By default, Atari environments display the game score and sometimes other information (like enemy ship count in Enduro) that could be used to infer the reward. In the original RLHF paper, they mask these regions.

Add old tests from imitation envs to seals

Old tests:

def test_model_envs(env):
    """Smoke test for each of the ModelBasedEnv methods with type checks.
    Args:
        env: The environment to test.
    Raises:
        AssertionError if test fails.
    """
    state = env.initial_state()
    assert env.pomdp_state_space.contains(state)

    action = env.action_space.sample()
    new_state = env.transition(state, action)
    assert env.pomdp_state_space.contains(new_state)

    reward = env.reward(state, action, new_state)
    assert isinstance(reward, float)

    done = env.terminal(state, 0)
    assert isinstance(done, bool)

    obs = env.obs_from_state(state)
    assert env.pomdp_observation_space.contains(obs)
    next_obs = env.obs_from_state(new_state)
    assert env.pomdp_observation_space.contains(next_obs)

Variable Horizon in seals/CartPole

from imitation.algorithms.adversarial.airl import AIRL
from imitation.rewards.reward_nets import BasicShapedRewardNet
from imitation.util.networks import RunningNorm
from stable_baselines3 import PPO
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.vec_env import DummyVecEnv, SubprocVecEnv

import gym
import seals

learners_rewards_after_training = []
learners_rewards_before_training = []
venv = DummyVecEnv([lambda: gym.make("seals/CartPole-v0")] * 8)
learner = PPO(
        env=venv,
        policy=MlpPolicy,
        batch_size=64,
        ent_coef=0.0,
        learning_rate=0.0003,
        n_epochs=10,
    )
reward_net = BasicShapedRewardNet(
        venv.observation_space, venv.action_space, normalize_input_layer=RunningNorm
    )
airl_trainer = AIRL(
        demonstrations=rollouts,
        demo_batch_size=1024,
        gen_replay_buffer_capacity=2048,
        n_disc_updates_per_round=4,
        venv=venv,
        gen_algo=learner,
        reward_net=reward_net
    )

for i in range(10):
     
    learner_rewards_before_training, _ = evaluate_policy(
        learner, venv, 100, return_episode_rewards=True
    )
    learners_rewards_before_training.append(learner_rewards_before_training)


    airl_trainer.train(20000)  # Note: set to 300000 for better results
    learner_rewards_after_training, _ = evaluate_policy(
        learner, venv, 100, return_episode_rewards=True
        ) 
    learners_rewards_after_training.append(learner_rewards_after_training)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_16872\944136942.py in <module>
     41 
     42 
---> 43     airl_trainer.train(20000)  # Note: set to 300000 for better results
     44     learner_rewards_after_training, _ = evaluate_policy(
     45         learner, venv, 100, return_episode_rewards=True

c:\users\stephane\documents\imitation\src\imitation\algorithms\adversarial\common.py in train(self, total_timesteps, callback)
    416         )
    417         for r in tqdm.tqdm(range(0, n_rounds), desc="round"):
--> 418             self.train_gen(self.gen_train_timesteps)
    419             for _ in range(self.n_disc_updates_per_round):
    420                 with networks.training(self.reward_train):

c:\users\stephane\documents\imitation\src\imitation\algorithms\adversarial\common.py in train_gen(self, total_timesteps, learn_kwargs)
    385 
    386         gen_trajs, ep_lens = self.venv_buffering.pop_trajectories()
--> 387         self._check_fixed_horizon(ep_lens)
    388         gen_samples = rollout.flatten_trajectories_with_rew(gen_trajs)
    389         self._gen_replay_buffer.store(gen_samples)

c:\users\stephane\documents\imitation\src\imitation\algorithms\base.py in _check_fixed_horizon(self, horizons)
     89         if len(horizons) > 1:
     90             raise ValueError(
---> 91                 f"Episodes of different length detected: {horizons}. "
     92                 "Variable horizon environments are discouraged -- "
     93                 "termination conditions leak information about reward. See"

ValueError: Episodes of different length detected: {548, 500}. Variable horizon environments are discouraged -- termination conditions leak information about reward. Seehttps://imitation.readthedocs.io/en/latest/guide/variable_horizon.html for more information. If you are SURE you want to run imitation on a variable horizon task, then please pass in the flag: `allow_variable_horizon=True`.

When trying to run demo from https://github.com/HumanCompatibleAI/imitation/blob/master/examples/4_train_airl.ipynb
with a for loop for the training steps it creates episodes of different horizons

TypeError with Python 3.9 in seals 0.2

I receive the following TypeError:

    from seals import atari, util
  File "/home/pavel/anaconda3/envs/py-3-9-test/lib/python3.9/site-packages/seals/atari.py", line 8, in <module>
    from seals.util import (
  File "/home/pavel/anaconda3/envs/py-3-9-test/lib/python3.9/site-packages/seals/util.py", line 140, in <module>
    class MaskScoreWrapper(
  File "/home/pavel/anaconda3/envs/py-3-9-test/lib/python3.9/typing.py", line 1037, in __init_subclass__
    raise TypeError(f"Some type variables ({s_vars}) are"
TypeError: Some type variables (+_ScalarType_co) are not listed in Generic[~ActType]

How to reproduce

create a new virtual environment with python 3.9
pip install seals==0.2
python -c "import seals"

This problem doesn't occur if you use python 3.8.
Also, downgrading to 0.1.5 solves this as well.
So, maybe it's related to changes to typing in python 3.9?

I'm not sure if seals officially supports python 3.9, but imitation presumably does and that's where I first ran into this problem.

Add overview to documentation

We should:

Add a quickstart showing how to install and use the project (slight expansion of what's in README.md probably suffices).
A bit about the motivations of the project, background, links to the repo, list of maintainers, etc.
An overview of the environments in the project.
How to cite the project. We should have some papers on this soon, but may also want to put up an overview paper on ArXiV, or perhaps to JOSS.

Need to actually implemement something non-trivial first, though :)