GithubHelp home page GithubHelp logo

maac's Introduction

Multi-Actor-Attention-Critic

Code for Actor-Attention-Critic for Multi-Agent Reinforcement Learning (Iqbal and Sha, ICML 2019)

Requirements

The versions are just what I used and not necessarily strict requirements.

How to Run

All training code is contained within main.py. To view options simply run:

python main.py --help

The "Cooperative Treasure Collection" environment from our paper is referred to as fullobs_collect_treasure in this repo, and "Rover-Tower" is referred to as multi_speaker_listener.

In order to match our experiments, the maximum episode length should be set to 100 for Cooperative Treasure Collection and 25 for Rover-Tower.

Citing our work

If you use this repo in your work, please consider citing the corresponding paper:

@InProceedings{pmlr-v97-iqbal19a,
  title =    {Actor-Attention-Critic for Multi-Agent Reinforcement Learning},
  author =   {Iqbal, Shariq and Sha, Fei},
  booktitle =    {Proceedings of the 36th International Conference on Machine Learning},
  pages =    {2961--2970},
  year =     {2019},
  editor =   {Chaudhuri, Kamalika and Salakhutdinov, Ruslan},
  volume =   {97},
  series =   {Proceedings of Machine Learning Research},
  address =      {Long Beach, California, USA},
  month =    {09--15 Jun},
  publisher =    {PMLR},
  pdf =      {http://proceedings.mlr.press/v97/iqbal19a/iqbal19a.pdf},
  url =      {http://proceedings.mlr.press/v97/iqbal19a.html},
}

maac's People

Contributors

hassamsheikh avatar shariqiqbal2810 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

maac's Issues

How to visualize the result?

Hi, I am wondering whether you have a result of the visualization so that the effectiveness of the algorithm on the two new envs can be seen directly?
I noticed that in baselines,common.vec_env, the VecEnv class has not defined the Render function, is there any idea I can do the visualization?
Thank you so much!

connection.py

Dear Shariq,

I am trying to run this code and I already have installed all of the requirements, but this is the error I have face:
python3.7/multiprocessing/connection.py", line 383, in _recv
raise EOFError
I was wondering if you could let me know how can I solve it.

Bests,
Azadeh

About SAC implementation

Hi, in your implementation, SAC is used but V is estimated by Q-function when updating critic and calculating target Q, instead of a separated value network in the original SAC paper. Would you please explain it or give some references? Thanks

when I run"python main.py fullobs_collect_treasure dir_1",there are the errors

Episodes 1-13 of 50000
Process Process-1:
Traceback (most recent call last):
File "/home/gezhixin/anaconda3/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/home/gezhixin/anaconda3/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/home/gezhixin/MAAC-master/utils/env_wrappers.py", line 20, in worker
ob = env.reset()
File "/home/gezhixin/anaconda3/lib/python3.6/site-packages/gym/core.py", line 66, in reset
raise NotImplementedError
NotImplementedError
Process Process-2:
Traceback (most recent call last):
File "/home/gezhixin/anaconda3/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/home/gezhixin/anaconda3/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/home/gezhixin/MAAC-master/utils/env_wrappers.py", line 20, in worker
ob = env.reset()
File "/home/gezhixin/anaconda3/lib/python3.6/site-packages/gym/core.py", line 66, in reset
raise NotImplementedError
NotImplementedError
Process Process-3:

VecEnv, why closing the Pipe()

Hi, in the code

class SubprocVecEnv(VecEnv):
    def __init__(self, env_fns, spaces=None):
        """
        envs: list of gym environments to run in subprocesses
        """
        self.waiting = False
        self.closed = False
        nenvs = len(env_fns)
        self.remotes, self.work_remotes = zip(*[Pipe() for _ in range(nenvs)])
        self.ps = [Process(target=worker, args=(work_remote, remote, CloudpickleWrapper(env_fn)))
            for (work_remote, remote, env_fn) in zip(self.work_remotes, self.remotes, env_fns)]
        for p in self.ps:
            p.daemon = True # if the main process crashes, we should not cause things to hang
            p.start()
        for remote in self.work_remotes:
            remote.close()

Why

        for remote in self.work_remotes:
            remote.close()

if the remote pipe is closed, how the message be sent?

About the reporduction of experiment Cooperative Treasure Collection

Thank for sharing the source code of MAAC. This is a very interesting papar. When i reproduce the experiments, the result of Cooperative Treasure Collection is quite different from the paper's. The parameter of episode_length is 100, and the source code statistics are the average of 100 steps for each agent.
def get_average_rewards(self, N):
if self.filled_i == self.max_steps:
inds = np.arange(self.curr_i - N, self.curr_i) # allow for negative indexing
else:
inds = np.arange(max(0, self.curr_i - N), self.curr_i)
return [self.rew_buffs[i][inds].sum() for i in range(self.num_agents)]
Therefore, i sum the values of each agent to get the result shown in the figure.
results
So, i want to know how to calculate the results in original paper!
Hope for you reply!!!

Wei Zhou,
[email protected]

2 agents modification

I noticed that the code creashes when 2 agents are used. since there are problems with the dimension for the sum function in critics.py line 138.

I managed to sort it out in this way:

for i, a_i in enumerate(agents):
    if max(agents) == 1:
        head_entropies = [(-((probs + 1e-8).log() * probs).squeeze().sum(0)
        .mean()) for probs in all_attend_probs[i]]
    else:
        head_entropies = [(-((probs + 1e-8).log() * probs).squeeze().sum(1)
                        .mean()) for probs in all_attend_probs[i]]

Does it sound good by you?

When I run "python main.py fullobs_collect_treasure V1" I meet error "ImportError: cannot import name 'Wall'"

The error log is as follows:
File "", line 678, in exec_module
File "", line 219, in _call_with_frames_removed
File "D:\project_code\python\yiwei\MAAC-master\envs\mpe_scenarios\fullobs_collect_treasure.py", line 3, in
from multiagent.core import World, Agent, Landmark, Wall
ImportError: cannot import name 'Wall'

It seems that the there is no "Wall" class in the multiagent repo, i search the the core.py file in multiagent repo https://github.com/openai/multiagent-particle-envs and find World, Agent, Landmark. But there is no Wall class. Do you change your multiagent.core file? what is the defination of the Wall class? Thanks.

Memory Leak

Hi,

When I run the code under the fullobs_collect_treasure domain on cpu only, I noticed there was memory leak happening inside of model.update_critic and model.update_policies functions. Even though the buffer is fully filled, the memory usage will keep going up and finally use out of my memory. I don't know which line of the code leads to this problem.

Does anyone run into this issue? Thank you!

State Action Encoding in Critic

I was going through your code and I am having difficult time understanding 1 part of the critic. If you see line

critic_in = torch.cat((s_encodings[i], *other_all_values[i]), dim=1)

you are just using a state encoding along with the joint embedding of all state-action pair of other agents as an input to the Q-function. If I recall correctly equation 5 takes an embedding of the current agents state-action pair alongside with joint embedding. Can you please explain what is going on here?

question about reward

why the average rewards reported in the paper is much higher than the code. it's ~6 after training, but the paper is 125. Did you change the reward in the environment?
and at the end of each episode, for example, in the multi_speaker_listener environment, the listener cannot reach to its target position. Is this the same as your results?

About query, key and value input embedding

In the code:
the input of sel_ext(query) is state_encodings
the input of k_ext(key) is state_action_encodings
the input of v_ext(value) is state_action_encodings
In the paper, the input of key and query should be state_action_encodings.

I think the correct input should be
the input of sel_ext(query) is state_action_encodings(change)
the input of k_ext(key) is state_action_encodings
the input of v_ext(value) is state_encodings(change)

Could you explain why this is done in the code?

About the results

Hello, thank you very much for being able to open source the code for this paper. This is a very good job!

When running this code, for the Cooperative Treasure Collection multi-agent environment, my results are as follows:

image

These results are quite different from the average reward in the paper, which is about 100, and I have not changed any parameters. Is there any special place in the calculation method of average reward?

How to evaluate and test the training result?

Hi,
Your codes are written very well~ Thanks for your great job! But in all folders, I only found codes for training the model and I didn’t find the codes for testing or evaluating the trained model in environments , would you mind uploading the evaluation codes file for testing or evaluating?
Thank you~

How to run your code in other scenarios, i.e., cooperative navigation

Hi, thanks for your great job! I met a problem when I run your code in other scenarios, i.e., simple_spread. The command I used is " python main.py simple_spread dataxx --use_gpu". Is it caused by the version of gym. I have tested both gym 0.9.4 and 0.12.5, but the following problem still exists. Could you please give me some advice on the problem? Thanks very much!
image
image

Seeding fails to produce deterministic results

Hi, thanks for this great code. I've been using it for some experiments and have been having some issues with replicability. One thing I notice is that learning curves are different even when I try and set the same random seed. I still get different results even if I do:

    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)  # if you are using multi-GPU.
    np.random.seed(seed)  # Numpy module.
    random.seed(seed)  # Python random module.
    torch.manual_seed(seed)
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.deterministic = True

pytorch/pytorch#7068

Have you noticed this issue and is there a way to resolve it?

Also, a quick unrelated question: for Treasure Collection I notice that by default you use substantial reward shaping. Was this shaping used for the Actor-Attention-Critic paper? Were you able to successfully train without it?

cannot handle some scenarios

hello,I try to run your code in simple_reference and simpel_world_com. However, the code reported error. Can MAAC deal with the both physical actions and communication actions scenairos?

issue thanks!

Ask for advice inps (list of PyTorch Matrices): Inputs to each agents' encoder (batch of obs + ac) How is this parameter set .
A=np.mat(obs1)
B=np.mat(action) ---- --> states = [s for s, a in inps] ValueError: too many values to unpack (expected 2)
inps=list((A,B))

Looking forward to your reply !
o_dim = 12
a_dim = 6

Critic encoders as shared modules ?

Dear Shariq,

In your article "Actor-Attention-Critic for Multi-Agent Reinforcement Learning", figure 1., there is a MLP, which is said to be unique per agent according to the legend (blue background color), that realize a state-action encoding from (o_i, a_i) to e_i.

I believed this corresponds to what is named "critic_encoders" in your code. However, these encoders are listed in the shared_modules here:

self.value_extractors, self.critic_encoders]

Is it normal ? From my understanding, the consequence from being listed in shared_modules is that the gradients are scaled by (1/agents_numbers), so I believe this would have only minor consequences on the behavior of the algorithm.

Best regards

How does the gradient back-propagate from Q to the action $a_i$?

I wonder how the gradient back propagate from Q to $a_i$.
Trace from Q:

MAAC/utils/critics.py

Lines 149 to 150 in 105d60e

all_q = self.critics[a_i](critic_in)
int_acs = actions[a_i].max(dim=1, keepdim=True)[1]

Then trace critic_in:
critic_in = torch.cat((s_encodings[i], *other_all_values[i]), dim=1)

Since s_encoding doesn't contain input from $a_i$, I then trace other_all_values[i]:

MAAC/utils/critics.py

Lines 125 to 141 in 105d60e

for curr_head_keys, curr_head_values, curr_head_selectors in zip(
all_head_keys, all_head_values, all_head_selectors):
# iterate over agents
for i, a_i, selector in zip(range(len(agents)), agents, curr_head_selectors):
keys = [k for j, k in enumerate(curr_head_keys) if j != a_i]
values = [v for j, v in enumerate(curr_head_values) if j != a_i]
# calculate attention across agents
attend_logits = torch.matmul(selector.view(selector.shape[0], 1, -1),
torch.stack(keys).permute(1, 2, 0))
# scale dot-products by size of key (from Attention is All You Need)
scaled_attend_logits = attend_logits / np.sqrt(keys[0].shape[1])
attend_weights = F.softmax(scaled_attend_logits, dim=2)
other_values = (torch.stack(values).permute(1, 2, 0) *
attend_weights).sum(dim=2)
other_all_values[i].append(other_values)
all_attend_logits[i].append(attend_logits)
all_attend_probs[i].append(attend_weights)

keys and values don't contain agent i's action as input, and selector uses only observations as input:

MAAC/utils/critics.py

Lines 118 to 119 in 105d60e

all_head_selectors = [[sel_ext(enc) for i, enc in enumerate(s_encodings) if i in agents]
for sel_ext in self.selector_extractors]

So, is there gradient from Q to action $a_i$?

About the code

Hi,
I'm very sorry to trouble you. I am reading your paper MAAC and I find that the result is compare with COMA. So I want to ask whether you can open the code about COMA.

Thanks!

does training advantage soft actor critic based on replay buffer has large bias?

Dear Author,

I take a fast look at your code on actor updates. It seems that you have use advantage soft actor critic, i.e.,
Advantage: pol_target = q - v
loss: pol_loss = (log_pi * (log_pi / self.reward_scale - pol_target).detach()).mean()

If you use the above updates, I think it's an on-policy soft A2C, therefore unbiased actor should only be updated based on the incremental data rather than the data from replay buffer. Otherwise, it will be an biased estimate of the real policy. Right?

Best,
Hui

Bias on value extractors ?

Dear Shariq,

In your article, there is no bias used to calculate the x_i. However, in the code the bias is not set to False, and I believe that the default value is True, for the value extractors:

self.value_extractors.append(nn.Sequential(nn.Linear(hidden_dim,

Is there a reason for that ? Thank you.

About environment

This unwrapped multiagent environment has abstract reset() method.
worker() method in make_env.py may be modified as follows:
Uploading as.GIF…
errors occur without these modifications in my python3.5. what about yours?

multi-agent particle environments

when i run your multi-agent particle environments,the error:
Traceback (most recent call last):
File "/home/cherry/multiagent-particle-envs-master/bin/interactive.py", line 26, in
env.render()
File "/home/cherry/anaconda3/envs/shyang/lib/python3.6/site-packages/gym/core.py", line 108, in render
raise NotImplementedError
NotImplementedError

Critic function learning

Hi Shariq,
In your implementation and MAAC paper, you use expected discounted returns to learn the state-action Q function, e.g., Eq. (2) and (7), instead of the maximum Q(s, a) w.r.t action a. Could you explain it or give a reference?
Best,
Yesiam

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.