GithubHelp home page GithubHelp logo

marlbenchmark / on-policy Goto Github PK

View Code? Open in Web Editor NEW
1.1K 7.0 269.0 238 KB

This is the official implementation of Multi-Agent PPO (MAPPO).

Home Page: https://sites.google.com/view/mappo

License: MIT License

Python 83.05% CMake 0.06% Shell 3.75% C++ 12.26% C 0.87%
hanabi mappo smac mpes starcraftii ppo multi-agent algorithms

on-policy's Introduction

MAPPO

New Update!!!We support SMAC V2 now~

Chao Yu*, Akash Velu*, Eugene Vinitsky, Jiaxuan Gao, Yu Wang, Alexandre Bayen, and Yi Wu.

This repository implements MAPPO, a multi-agent variant of PPO. The implementation in this repositorory is used in the paper "The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games" (https://arxiv.org/abs/2103.01955). This repository is heavily based on https://github.com/ikostrikov/pytorch-a2c-ppo-acktr-gail. We also make the off-policy repo public, please feel free to try that. off-policy link

All hyperparameters and training curves are reported in appendix, we would strongly suggest to double check the important factors before runing the code, such as the rollout threads, episode length, ppo epoch, mini-batches, clip term and so on. Besides, we have updated the newest results on google football testbed and suggestions about the episode length and parameter-sharing in appendix, welcome to check that.

We have recently noticed that a lot of papers do not reproduce the mappo results correctly, probably due to the rough hyper-parameters description. We have updated training scripts for each map or scenario in /train/train_xxx_scripts/*.sh. Feel free to try that.

Environments supported:

1. Usage

WARNING: by default all experiments assume a shared policy by all agents i.e. there is one neural network shared by all agents

All core code is located within the onpolicy folder. The algorithms/ subfolder contains algorithm-specific code for MAPPO.

  • The envs/ subfolder contains environment wrapper implementations for the MPEs, SMAC, and Hanabi.

  • Code to perform training rollouts and policy updates are contained within the runner/ folder - there is a runner for each environment.

  • Executable scripts for training with default hyperparameters can be found in the scripts/ folder. The files are named in the following manner: train_algo_environment.sh. Within each file, the map name (in the case of SMAC and the MPEs) can be altered.

  • Python training scripts for each environment can be found in the scripts/train/ folder.

  • The config.py file contains relevant hyperparameter and env settings. Most hyperparameters are defaulted to the ones used in the paper; however, please refer to the appendix for a full list of hyperparameters used.

2. Installation

Here we give an example installation on CUDA == 10.1. For non-GPU & other CUDA version installation, please refer to the PyTorch website. We remark that this repo. does not depend on a specific CUDA version, feel free to use any CUDA version suitable on your own computer.

# create conda environment
conda create -n marl python==3.6.1
conda activate marl
pip install torch==1.5.1+cu101 torchvision==0.6.1+cu101 -f https://download.pytorch.org/whl/torch_stable.html
# install on-policy package
cd on-policy
pip install -e .

Even though we provide requirement.txt, it may have redundancy. We recommend that the user try to install other required packages by running the code and finding which required package hasn't installed yet.

2.1 StarCraftII 4.10

unzip SC2.4.10.zip
# password is iagreetotheeula
echo "export SC2PATH=~/StarCraftII/" >> ~/.bashrc

For SMAC v2, please refer to https://github.com/oxwhirl/smacv2.git. Make sure you have the 32x32_flat.SC2Map map file in your SMAC_Maps folder.

2.2 Hanabi

Environment code for Hanabi is developed from the open-source environment code, but has been slightly modified to fit the algorithms used here.
To install, execute the following:

pip install cffi
cd envs/hanabi
mkdir build & cd build
cmake ..
make -j

Here are all hanabi models.

2.3 MPE

# install this package first
pip install seaborn

There are 3 Cooperative scenarios in MPE:

  • simple_spread
  • simple_speaker_listener, which is 'Comm' scenario in paper
  • simple_reference

2.4 GRF

Please see the football repository to install the football environment.

3.Train

Here we use train_mpe.sh as an example:

cd onpolicy/scripts
chmod +x ./train_mpe.sh
./train_mpe.sh

Local results are stored in subfold scripts/results. Note that we use Weights & Bias as the default visualization platform; to use Weights & Bias, please register and login to the platform first. More instructions for using Weights&Bias can be found in the official documentation. Adding the --use_wandb in command line or in the .sh file will use Tensorboard instead of Weights & Biases.

We additionally provide ./eval_hanabi_forward.sh for evaluating the hanabi score over 100k trials.

4. Publication

If you find this repository useful, please cite our paper:

@inproceedings{
yu2022the,
title={The Surprising Effectiveness of {PPO} in Cooperative Multi-Agent Games},
author={Chao Yu and Akash Velu and Eugene Vinitsky and Jiaxuan Gao and Yu Wang and Alexandre Bayen and Yi Wu},
booktitle={Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
year={2022}
}

on-policy's People

Contributors

akashvelu avatar colourfulspring avatar deedive avatar eugenevinitsky avatar hosnls avatar jason-huang03 avatar jensenlzx avatar jxwuyi avatar samuelebolotta avatar sapio-s avatar xhn-1 avatar zoeyuchao avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

on-policy's Issues

PopArt implementation is much different from the paper you cited !

I just read about the original paper that come up with "PopArt", their idea is mainly to keep the normalization factors from affecting learning process by modfying "W, b" of the last layer, but your implementation is just a z-transformation for value normalization, which actually hurts training in my experiments. Maybe you should respect the original PopArt paper?

The problem about actor loss

The mappo works well in my environment. The reward function can increase. The critic loss can also be decreased to convergence. However, why my actor loss increases to convergence, shouldn't it decrease? Could you please explain my question? Thank you.

Using a global state

In the config.py file, there is this env parameter:

    parser.add_argument("--use_obs_instead_of_state", action='store_true',
                        default=False, help="Whether to use global state or concatenated obs")

I would like to use a global state in my code. However, I don't understand how the aforementioned parameter is being used, since the base_runner is the only script that extracts it (but then doesn't use it anyway). Thanks!

W & B errors about running train_mpe

When running the train_mpe file, prompts W & B, I choose to do not need visual results, but run errors, run this program to use W & B? because I have been using Tensorboard, I haven't used W & B, I don't know how to use it. Whether this program must be used to use W & B, can you join the Tensorboard selection, thank you!

Does the param 'n_rollout_threads' influence the final performance greatly?

Hello, when I read and run the rmappo with default parameters in the 'train_mpe.sh', I find that the convergence rate becomes slow in the simple_spread when I set 'n_rollout_threads=20' instead of original '128'. So I want to ask for advice from the author about the influence of 'n_rollout_threads' to the algorithm's performance. In my opinion, this param decides the number of the parallel envs and the numbe of samples during per ppo-update. I am looking forward to hearing from the author.

Metric problem about MAPPO

Nice paper and project! And we are also doing some research on this topic.

We find there is something confused in the original MAPPO paper: why the shadow parts of the following figure in page 7 are more than win rate 1.0? What kind of metric you use in this figure? Can you offer some detailed statements? Thank you!

Screen Shot 2021-12-15 at 7 32 19 PM

AssertionError: check recurrent policy!

I have run your code:

./train_smac.sh

however, the following error occured:

env is StarCraft2, map is corridor, algo is mappo, exp is mlp, max seed is 1
seed is 1:
Traceback (most recent call last):
  File "train/train_smac.py", line 175, in <module>
    main(sys.argv[1:])
  File "train/train_smac.py", line 82, in main
    "check recurrent policy!")
AssertionError: check recurrent policy!

I changed the argument "algo" in file "tran_smac.sh"

algo="rmappo"

I don't know whether this modification is suitable or not. It did work, but the results were not satisfactory:
image
image

Do you have some good parameters for trainnning?

continuous action space

Hi, can MAPPO be used for continuous action space? How can I do this?When I change discrete_action under environment.py to False, the following error will appear.
image

Centralized-V between IPPO and MAPPO

Hi @jxwuyi @eugenevinitsky @zoeyuchao @akashvelu

Thanks for your work!

Just a quick question: is turning on or off the use_centralized_V only affects if the input is from local observations or the centralized state? Does it affect the structure of actual value network used? from the code what I can see is no matter use_centralized_V is true or false, the input size is always (num_agents, input_dim), and the output values are with size (num_agents, 1). So the networks are not affected by use_centralized_V. And centralized value outputs will be the same for "num_agents", right?

Look forward to your reply! Thank you!

Best,
Xubo

How to visualize the .wandb and the others

I hope i can visualize the average_episode_rewards,but in this folder I can only see a file named "events.out.tfevents.xxx.amax".I hope i can open this file,is there somewhere i should change? if you can tell me i would be appreciate.
i hope you can tell me how to open .wandb file

Being confused about The huber_loss

the function huber_loss in utils is like:

def huber_loss(e, d):
    a = (abs(e) <= d).float()
    b = (e > d).float()
    return a*e**2/2 + b*d*(abs(e)-d/2)

It may come with a zero loss when error is greater than huber_delta.

If I'm not mistaken,it should be
b = (abs(e) > d).float()

Looking forward to hearing from you.

A few questions about death masking

Could you confirm if I understood it right according to #1:

  1. You return death mask which is used only for policy to calculate advantages, policy loss and entropy plus change global state for agent. And thats all? I tried similar thing with IPPO and didn't work good.

  2. Could you point out to the feature pruning vs default env global state in the code?

Thanks,
Denys

Death Masking different from the paper

Hello!
I checked your paper and found in section 4.5 that the "death masking" is to simply use an agent-specific
constant vector, i.e., a zero vector with the agent’s ID, as the input to the value function after an agent dies. However, I can't find the corresponding code in this project. You seem to apply "active mask" to the entropy term, not the state.
Do I miss any important message in this project?

About Hanabi Implementation Details

Why throw away the reward of step 0 when collecting samples for Hanabi Environment?

  def collect(self, step):
      for current_agent_id in range(self.num_agents):
          ...
          # rearrange reward
          # reward of step 0 will be thrown away.
          self.turn_rewards[choose, current_agent_id] = self.turn_rewards_since_last_action[choose, current_agent_id].copy()
          self.turn_rewards_since_last_action[choose, current_agent_id] = 0.0
          self.turn_rewards_since_last_action[choose] += rewards[choose]
          ...

很好的工作,但是... Good work, but ...

如果你们让QMIX用8个进程,增大Batch Size和或者每次训练epoch次数,最后加上 TD(lambda)<=0.5
QMIX能把这些算法干趴下
参考我们的简略调参 https://arxiv.org/abs/2102.03479
我真的发现 MARL这个领域 由于调参的问题导致一大堆错误的结论和实验甚至motivation出发就错了,涉及十篇+ CCFA顶会paper
尤其是 AAAI 这个会议的文章连证明是错的都能 accept

Hyperparameters of IPPO

Are the choices of IPPO hyperparameters the same as MAPPO shown in Table 12? The only difference is the value of "use_centralized_V" (False for IPPO, True for MAPPO), right? Thanks!

Trained Hanabi agents

Hi

Is it possible to get access to the 4 mappo trained actor and critic models in Hanabi that you refer to in the paper?

Many thanks!

Can you open-source MASAC code base?

Hello,
Thanks for open-sourcing a really good work. I was wondering if you guys can open-source the MASAC code base as it would help to understand the variations of MASAC with MADDPG and MAPPO. Thanks, in advance for the help.

Why use self.buffer[agent_id].after_update()?

def train(self):
train_infos = []
for agent_id in range(self.num_agents):
self.trainer[agent_id].prep_training()
train_info = self.trainer[agent_id].train(self.buffer[agent_id])
train_infos.append(train_info)
self.buffer[agent_id].after_update()

    return train_infos

why use after_update() after train()?

The wandb backend process has shutdown

Has anyone had a problem similar to me ?I ran this code on the Colab platform。
/content/drive/MyDrive/MAPPO/on-policy-main/onpolicy/scripts
env is MPE, scenario is simple_spread, algo is rmappo, exp is check, max seed is 1
seed is 1:
choose to use cpu...
wandb: Currently logged in as: yi-li (use wandb login --relogin to force relogin)
wandb: Tracking run with wandb version 0.12.11
wandb: Run data is saved locally in /content/drive/MyDrive/MAPPO/on-policy-main/onpolicy/scripts/results/MPE/simple_spread/rmappo/check/wandb/run-20220303_092843-1t4tb6nn
wandb: Run wandb offline to turn off syncing.
wandb: Syncing run rmappo_check_seed1
wandb: ⭐️ View project at https://wandb.ai/yi-li/MPE
wandb: 🚀 View run at https://wandb.ai/yi-li/MPE/runs/1t4tb6nn
Exception in thread NetStatThr:
Traceback (most recent call last):
File "/usr/lib/python3.7/threading.py", line 926, in _bootstrap_inner
self.run()
File "/usr/lib/python3.7/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/wandb_run.py", line 149, in check_network_status
status_response = self._interface.communicate_network_status()
File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface.py", line 125, in communicate_network_status
resp = self._communicate_network_status(status)
File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface_shared.py", line 397, in _communicate_network_status
resp = self._communicate(req, local=True)
File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface_shared.py", line 222, in _communicate
return self._communicate_async(rec, local=local).get(timeout=timeout)
File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface_shared.py", line 227, in _communicate_async
raise Exception("The wandb backend process has shutdown")
Exception: The wandb backend process has shutdown

Exception in thread ChkStopThr:
Traceback (most recent call last):
File "/usr/lib/python3.7/threading.py", line 926, in _bootstrap_inner
self.run()
File "/usr/lib/python3.7/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/wandb_run.py", line 167, in check_status
status_response = self._interface.communicate_stop_status()
File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface.py", line 114, in communicate_stop_status
resp = self._communicate_stop_status(status)
File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface_shared.py", line 387, in _communicate_stop_status
resp = self._communicate(req, local=True)
File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface_shared.py", line 222, in _communicate
return self._communicate_async(rec, local=local).get(timeout=timeout)
File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface_shared.py", line 227, in _communicate_async
raise Exception("The wandb backend process has shutdown")
Exception: The wandb backend process has shutdown

and this is my debug.log:
2022-03-03 09:28:43,805 INFO MainThread:4971 [wandb_setup.py:_flush():75] Loading settings from /root/.config/wandb/settings
2022-03-03 09:28:43,806 INFO MainThread:4971 [wandb_setup.py:_flush():75] Loading settings from /content/drive/MyDrive/MAPPO/on-policy-main/onpolicy/scripts/wandb/settings
2022-03-03 09:28:43,806 INFO MainThread:4971 [wandb_setup.py:_flush():75] Loading settings from environment variables: {}
2022-03-03 09:28:43,806 INFO MainThread:4971 [wandb_setup.py:_flush():75] Inferring run settings from compute environment: {'program_relpath': 'train/train_mpe.py', 'program': 'train/train_mpe.py'}
2022-03-03 09:28:43,808 INFO MainThread:4971 [wandb_init.py:_log_setup():405] Logging user logs to /content/drive/MyDrive/MAPPO/on-policy-main/onpolicy/scripts/results/MPE/simple_spread/rmappo/check/wandb/run-20220303_092843-1t4tb6nn/logs/debug.log
2022-03-03 09:28:43,808 INFO MainThread:4971 [wandb_init.py:_log_setup():406] Logging internal logs to /content/drive/MyDrive/MAPPO/on-policy-main/onpolicy/scripts/results/MPE/simple_spread/rmappo/check/wandb/run-20220303_092843-1t4tb6nn/logs/debug-internal.log
2022-03-03 09:28:43,809 INFO MainThread:4971 [wandb_init.py:init():439] calling init triggers
2022-03-03 09:28:43,809 INFO MainThread:4971 [wandb_init.py:init():443] wandb.init called with sweep_config: {}
config: {'algorithm_name': 'rmappo', 'experiment_name': 'check', 'seed': 1, 'cuda': True, 'cuda_deterministic': True, 'n_training_threads': 1, 'n_rollout_threads': 128, 'n_eval_rollout_threads': 1, 'n_render_rollout_threads': 1, 'num_env_steps': 20000000, 'user_name': 'yi-li', 'use_wandb': True, 'env_name': 'MPE', 'use_obs_instead_of_state': False, 'episode_length': 25, 'share_policy': True, 'use_centralized_V': True, 'stacked_frames': 1, 'use_stacked_frames': False, 'hidden_size': 64, 'layer_N': 1, 'use_ReLU': False, 'use_popart': False, 'use_valuenorm': True, 'use_feature_normalization': True, 'use_orthogonal': True, 'gain': 0.01, 'use_naive_recurrent_policy': False, 'use_recurrent_policy': True, 'recurrent_N': 1, 'data_chunk_length': 10, 'lr': 0.0007, 'critic_lr': 0.0007, 'opti_eps': 1e-05, 'weight_decay': 0, 'ppo_epoch': 10, 'use_clipped_value_loss': True, 'clip_param': 0.2, 'num_mini_batch': 1, 'entropy_coef': 0.01, 'value_loss_coef': 1, 'use_max_grad_norm': True, 'max_grad_norm': 10.0, 'use_gae': True, 'gamma': 0.99, 'gae_lambda': 0.95, 'use_proper_time_limits': False, 'use_huber_loss': True, 'use_value_active_masks': True, 'use_policy_active_masks': True, 'huber_delta': 10.0, 'use_linear_lr_decay': False, 'save_interval': 1, 'log_interval': 5, 'use_eval': False, 'eval_interval': 25, 'eval_episodes': 32, 'save_gifs': False, 'use_render': False, 'render_episodes': 5, 'ifi': 0.1, 'model_dir': None, 'scenario_name': 'simple_spread', 'num_landmarks': 3, 'num_agents': 3}
2022-03-03 09:28:43,809 INFO MainThread:4971 [wandb_init.py:init():492] starting backend
2022-03-03 09:28:43,809 INFO MainThread:4971 [backend.py:_multiprocessing_setup():101] multiprocessing start_methods=fork,spawn,forkserver, using: spawn
2022-03-03 09:28:43,831 INFO MainThread:4971 [backend.py:ensure_launched():219] starting backend process...
2022-03-03 09:28:43,840 INFO MainThread:4971 [backend.py:ensure_launched():225] started backend process with pid: 4987
2022-03-03 09:28:43,851 INFO MainThread:4971 [wandb_init.py:init():501] backend started and connected
2022-03-03 09:28:43,867 INFO MainThread:4971 [wandb_init.py:init():565] updated telemetry
2022-03-03 09:28:43,873 INFO MainThread:4971 [wandb_init.py:init():596] communicating run to backend with 30 second timeout
2022-03-03 09:28:45,203 INFO MainThread:4971 [wandb_run.py:_on_init():1759] communicating current version
2022-03-03 09:28:45,250 INFO MainThread:4971 [wandb_run.py:_on_init():1763] got version response
2022-03-03 09:28:45,251 INFO MainThread:4971 [wandb_init.py:init():625] starting run threads in backend
2022-03-03 09:28:49,989 INFO MainThread:4971 [wandb_run.py:_console_start():1733] atexit reg
2022-03-03 09:28:49,993 INFO MainThread:4971 [wandb_run.py:_redirect():1606] redirect: SettingsConsole.REDIRECT
2022-03-03 09:28:49,994 INFO MainThread:4971 [wandb_run.py:_redirect():1611] Redirecting console.
2022-03-03 09:28:50,000 INFO MainThread:4971 [wandb_run.py:_redirect():1667] Redirects installed.
2022-03-03 09:28:50,001 INFO MainThread:4971 [wandb_init.py:init():664] run started, returning control to user process

Cannot reproduce the MPE results

Hi,
Thanks for your contribution to the community. I directly run the given example on MPE (Simple-Spread) using the default parameters except for changing the 'n_rollout_threads' into 8 due to the computation limit. However, the results on the W&B can only achieve around -170, which seems not working.

Has anybody met such a problem? Could you help me fix it?

Thanks!

cannot reproduce the performance of MPE

Hi, I have an issue when reproducing the performance of simple_spread in MPE.

The only modifications on your code:

  1. use --use_wandb to disable wandb in train_mpe.sh
  2. add self.envs.reset() before line 26 in mpe_runner.py

Values of Hyperparameters

Could you provide the values of "--use_popart", "--use_valuenorm", "--use_value_active_masks", "--use_policy_active_masks" across all SMAC maps to help better reproduce your results? Thanks a lot!

Error in setting `bad_transition`

elif self._episode_steps >= self.episode_limit:
# Episode limit reached
terminated = True
self.bad_transition = True
if self.continuing_episode:
info["episode_limit"] = True
self.battles_game += 1
self.timeouts += 1
for i in range(self.n_agents):
infos[i] = {
"battles_won": self.battles_won,
"battles_game": self.battles_game,
"battles_draw": self.timeouts,
"restarts": self.force_restarts,
"bad_transition": bad_transition,
"won": self.win_counted
}

It should be bad_transition = True in line 563.

Differences between the starcraft environment and environment in SMAC

Hi authors,

Really appreciate this nice codebase for MAPPO. I have one issue that I find out the environment used in MAPPO is somehow different with the environment in SMAC. For example, you can compare how this environment and starcraft in SMAC implement the global_state function:

https://github.com/oxwhirl/smac/blob/a54ebb937e44dc9c703d96064f423069532c7b66/smac/env/starcraft2/starcraft2.py#L1135

def get_state(self, agent_id=-1):

The get_state function in SMAC is the real global state while the get_state function in your starcraft is still an agent-dependent state. I am wondering why these two environments are different.

An error occurs when I run rmappo on football

The output of python is listed here:

Traceback (most recent call last):
File "train/train_football.py", line 203, in
main(sys.argv[1:])
File "train/train_football.py", line 188, in main
runner.run()
File "/onpolicy/runner/shared/football_runner.py", line 43, in run
self.insert(data)
File /onpolicy/runner/shared/football_runner.py", line 141, in insert
masks=masks
TypeError: insert() got an unexpected keyword argument 'rnn_states'

envs reset in data-collecting and evaluation period

Both in data collecting and evaluation period, when an episode terminated, the model just take the last obs of corresponding env as input. But i think the envs should be reset if reach termination. Or i missed something in the code?

`eval_average_episode_rewards` is not defined in shared/mpe_runner.py

This is a bug that I fixed by taking a look at the other onpolicy code that you have. In shared/mpe_runner.py in def eval() almost at the end of the function:

        eval_episode_rewards = np.array(eval_episode_rewards)
        eval_env_infos = {}
        eval_env_infos['eval_average_episode_rewards'] = np.sum(np.array(eval_episode_rewards), axis=0)
        # print("eval average episode rewards of agent: " + str(eval_average_episode_rewards))

The eval_average_episode_rewards is not defined and code will exit with error. Instead I used:

print("eval average episode rewards of agent: " + str(np.mean(eval_env_infos['eval_average_episode_rewards'])))

This is different than in separated/mpe_runner.py:

        eval_train_infos = []
        for agent_id in range(self.num_agents):
            eval_average_episode_rewards = np.mean(np.sum(eval_episode_rewards[:, :, agent_id], axis=0))
            eval_train_infos.append({'eval_average_episode_rewards': eval_average_episode_rewards})
            print("eval average episode rewards of agent%i: " % agent_id + str(eval_average_episode_rewards))

but i guess the logic is that in the shared case agent1 and agent2 are the same so the average reward between their performance is reasonable.

cannot reproduce the performance of MPE

Hi, I have an issue when reproducing the performance of simple_spread in MPE.

The only modifications on your code:

  1. use --use_wandb to disable wandb in train_mpe.sh
  2. add self.envs.reset() before line 26 in mpe_runner.py

GPU error

Hi

The code works in cpu but I get the below problem with gpu:

RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`

Full details are below:

env is Hanabi, algo is mappo, exp is mlp_critic1e-3_entropy0.015_v0belief, max seed is 1
seed is 1:
choose to use gpu...
Traceback (most recent call last):
  File "train/train_hanabi_forward.py", line 176, in <module>
    main(sys.argv[1:])
  File "train/train_hanabi_forward.py", line 161, in main
    runner.run()
  File "/nfs/home/ic/on-policy/onpolicy/runner/shared/hanabi_runner_forward.py", line 50, in run
    self.collect(step) 
  File "/nfs/home/ic/miniconda3/envs/marl/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
    return func(*args, **kwargs)
  File "/nfs/home/ic/on-policy/onpolicy/runner/shared/hanabi_runner_forward.py", line 153, in collect
    self.use_available_actions[choose])
  File "/nfs/home/ic/on-policy/onpolicy/algorithms/r_mappo/algorithm/rMAPPOPolicy.py", line 71, in get_actions
    deterministic)
  File "/nfs/home/ic/miniconda3/envs/marl/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/nfs/home/ic/on-policy/onpolicy/algorithms/r_mappo/algorithm/r_actor_critic.py", line 62, in forward
    actor_features = self.base(obs)
  File "/nfs/home/ic/miniconda3/envs/marl/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/nfs/home/ic/on-policy/onpolicy/algorithms/utils/mlp.py", line 54, in forward
    x = self.mlp(x)
  File "/nfs/home/ic/miniconda3/envs/marl/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/nfs/home/ic/on-policy/onpolicy/algorithms/utils/mlp.py", line 25, in forward
    x = self.fc1(x)
  File "/nfs/home/ic/miniconda3/envs/marl/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/nfs/home/ic/miniconda3/envs/marl/lib/python3.7/site-packages/torch/nn/modules/container.py", line 100, in forward
    input = module(input)
  File "/nfs/home/ic/miniconda3/envs/marl/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/nfs/home/ic/miniconda3/envs/marl/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 87, in forward
    return F.linear(input, self.weight, self.bias)
  File "/nfs/home/ic/miniconda3/envs/marl/lib/python3.7/site-packages/torch/nn/functional.py", line 1610, in linear
    ret = torch.addmm(bias, input, weight.t())
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`
training is done!

I get the error with both cuda 10.1 and 11.2. Do you have any thoughts on how to fix this? Many thanks!

QMIX performance in Hanabi

I read and learn interesting insights from the MAPPO paper. I have a few questions regarding the paper (and its previous version).

I saw the performance of QMIX in Hanabi reported in a previous version of the paper which was very low (0.29 in small version). I'm curious why it performed so badly? The results reported in the current version with VDN performs way better despite QMIX being superset of VDN. I understand that the current paper use the full version of the game, still, its score almost reaches the optimum while QMIX (in small version) doesn't seem to learn meaningful policy. Is it possible to get similar performance to VDN using QMIX?

wandb error

I met the error: wandb: ERROR Error while calling W&B API: project not found (<Response [404]>)

I have changed the config.py about the --user-name to mine, but it doesn't work either.

Thank you for your reply.

permission denied

When reproducing the project, the last step. / train_ MPE has an error in requesting link data from the project owner. How can I solve it?

TypeError: cannot assign 'torch.FloatTensor' as parameter 'stddev' (torch.nn.Parameter or None expected)

Running train_mpe

The code reports an error

Line 63 in the file "onpolicy/algorithms/utils/popart.py"

    self.stddev = (self.mean_sq - self.mean ** 2).sqrt().clamp(min=1e-4)
    self.weight = self.weight * old_stddev / self.stddev
    self.bias = (old_stddev * self.bias + old_mean - self.mean) / self.stddev

File "/on-policy-main/onpolicy/algorithms/utils/popart.py", line 63, in update
self.stddev = (self.mean_sq - self.mean ** 2).sqrt().clamp(min=1e-4)
File "/home/cc/anaconda3/envs/MAPPO/lib/python3.6/site-packages/torch/nn/modules/module.py", line 613, in setattr
.format(torch.typename(value), name))
TypeError: cannot assign 'torch.FloatTensor' as parameter 'stddev' (torch.nn.Parameter or None expected)

Whether the above code needs to be changed to the following code:

    self.stddev = nn.Parameter((self.mean_sq - self.mean ** 2).sqrt().clamp(min=1e-4))

    self.weight = nn.Parameter(self.weight * old_stddev / self.stddev)
    self.bias = nn.Parameter((old_stddev * self.bias + old_mean - self.mean) / self.stddev)

Is seperated runner fully supported?

Hi @jxwuyi @eugenevinitsky @zoeyuchao @akashvelu

Thanks a lot for your great work! It helps a lot in synchronizing the performance of the on-policy DRL algorithm from single to multi-agent cases.

I found out there are some missing files about separated runner working on specific environments (e.g. SMAC or MPE), in the folder "onpolicy/runner/seperated/xxx". But I noticed that there is a base_runner implemented. So I was wondering if the separate runner (maybe plus the separate buffer) is a ready-to-go implementation here with PPO optimization? If so, I can directly use it with my customized environment and run the experiments, similarly to the use of shared runner.

Look forward to your reply! Thanks!

Best,
Xubo

About QMix(MG)

Hello! I found that in your new version of the MAPPO paper, you use a concatenation of the default environment global state, as well as all agents’ local observations, as the mixer network input. But why don't you instead concatenate the Feature-Pruned Agent-Specific Global States which is used in MAPPO to build the input of the mixer network? Is this unfair for the comparison?

training MAPPO and RMAPPO with MPE gives an error!

Hello,
Thanks for the code base and indeed a really good work!
I was trying to replicate the results from the paper and started with MPE environment with mappo and rmappo.
But I when I train the rmappo with MPE, I get the following error:

env is MPE, scenario is simple_spread, algo is mappo, exp is check, max seed is 1
seed is 1:
choose to use gpu...
Traceback (most recent call last):
  File "train/train_mpe.py", line 167, in <module>
    main(sys.argv[1:])
  File "train/train_mpe.py", line 152, in main
    runner.run()
  File "/home/kailash/Desktop/on-policy/onpolicy/runner/shared/mpe_runner.py", line 47, in run
    self.save()
  File "/home/kailash/Desktop/on-policy/onpolicy/runner/shared/base_runner.py", line 134, in save
    policy_vnorm = self.trainer.policy.value_normalizer
AttributeError: 'R_MAPPOPolicy' object has no attribute 'value_normalizer'

when I train the mappo with MPE, I get the following error:

env is MPE, scenario is simple_spread, algo is mappo, exp is check, max seed is 1
seed is 1:
Traceback (most recent call last):
  File "train/train_mpe.py", line 167, in <module>
    main(sys.argv[1:])
  File "train/train_mpe.py", line 71, in main
    assert (all_args.use_recurrent_policy == False and all_args.use_naive_recurrent_policy == False), ("check recurrent policy!")
AssertionError: check recurrent policy!

Please let me know if there is any mistake I did. Thanks for the help!

Evaluation ONLY mode in MAPE environment

From what I understand, in MAPE, evaluation is "entangled" in the training mode (there is alteration between training and evaluation phase). Is there any way (or a script) that I can evaluate only a trained agent and save gifs/videos etc?

use_centralized_V

Hi, dear author.
I find that each agent using the independent V value funtion (agent_specific_state) in the code?. Why is it called 'use_centralized_V'?
In my opinion, 'use_centralized_V' means all agents share the same V value function.
Thanks!

code error about value norm

In r_mappo.py, line 175

        if self._use_popart:
            advantages = buffer.returns[:-1] - self.value_normalizer.denormalize(buffer.value_preds[:-1])
        else:
            advantages = buffer.returns[:-1] - buffer.value_preds[:-1]

should be fixed

        if self._use_popart or self._use_valuenorm:
            advantages = buffer.returns[:-1] - self.value_normalizer.denormalize(buffer.value_preds[:-1])
        else:
            advantages = buffer.returns[:-1] - buffer.value_preds[:-1]

This error collapses the performance of the PPO.

Others

It seems that the paper mentions the use of PopArt, but it is not actually used in the code.
Finally, MAPPO (centralized V) is mentioned in the abstract of the paper, but it is actually IPPO with global information because the value function is not centralized.

Thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.