shariqiqbal2810 / maac Goto Github PK

View Code? Open in Web Editor NEW

639.0 7.0 170.0 30 KB

Code for "Actor-Attention-Critic for Multi-Agent Reinforcement Learning" ICML 2019

License: MIT License

Python 100.00%

maac's Introduction

Multi-Actor-Attention-Critic

Code for Actor-Attention-Critic for Multi-Agent Reinforcement Learning (Iqbal and Sha, ICML 2019)

Requirements

Python 3.6.1 (Minimum)
OpenAI baselines, commit hash: 98257ef8c9bd23a24a330731ae54ed086d9ce4a7
My fork of Multi-agent Particle Environments
PyTorch, version: 0.3.0.post4
OpenAI Gym, version: 0.9.4
Tensorboard, version: 0.4.0rc3 and Tensorboard-Pytorch, version: 1.0 (for logging)

The versions are just what I used and not necessarily strict requirements.

How to Run

All training code is contained within main.py. To view options simply run:

python main.py --help

The "Cooperative Treasure Collection" environment from our paper is referred to as fullobs_collect_treasure in this repo, and "Rover-Tower" is referred to as multi_speaker_listener.

In order to match our experiments, the maximum episode length should be set to 100 for Cooperative Treasure Collection and 25 for Rover-Tower.

Citing our work

If you use this repo in your work, please consider citing the corresponding paper:

@InProceedings{pmlr-v97-iqbal19a,
  title =    {Actor-Attention-Critic for Multi-Agent Reinforcement Learning},
  author =   {Iqbal, Shariq and Sha, Fei},
  booktitle =    {Proceedings of the 36th International Conference on Machine Learning},
  pages =    {2961--2970},
  year =     {2019},
  editor =   {Chaudhuri, Kamalika and Salakhutdinov, Ruslan},
  volume =   {97},
  series =   {Proceedings of Machine Learning Research},
  address =      {Long Beach, California, USA},
  month =    {09--15 Jun},
  publisher =    {PMLR},
  pdf =      {http://proceedings.mlr.press/v97/iqbal19a/iqbal19a.pdf},
  url =      {http://proceedings.mlr.press/v97/iqbal19a.html},
}

maac's People

Contributors

Stargazers

Watchers

Forkers

selinaxulin tuladhay sanshibayuan zbwglory megayeye fangego hassamsheikh mamengyiyi wwxfromtju hatleon yunqiuxu chenlheng ibrahimth chloe4d gavinljj joosephook stjordanis wanghuimu sfwatergit newbieyxy agiant diligentandaspiration lrjxaint liuqiangopenmind zhangmwg obitoquilt bmpcc6k happyfaye hcch0912 nidao66 teenspirit-hao youngflyasd zwfightzw yongjin-shin lanewei yinjiangjin tiantiantian123 tao2020 yueweizhizhu jcridev billhan-fantast rainwangphy zhaoyangacc papoudakis jeme-yufeng-zhan sunnyem suhridbuddha tonylibing shijinming machengshen federicowong cheryl0605 ryrobotics robin-francis llt1 peng-cpu law101 geotyper ankur-deka junbozhao fan-pu hepengli xuezzee arshsagar bututoubaobei huangxf14 yangwagn sihongho daunfamily brjathu liluoniuniu makiah sjtuwbl zjhui11 judylalala timefly-1989 zhangtjtongxue qiu1234567 yutongamber yuchen-x luxianglin ewanlee yaxuniu agent-only colin-fox zipsharp melsaa bakqui yuanleirl ecustboy rakshithaarun johnson7788 vousmevoyess sangwooj leehe228 simcs superyyran redvinaa jackokiezhao weichengtseng

maac's Issues

The function names of "update_policies" and "update_critic" are reversed

The function names of "update_policies" and "update_critic" functions are reversed.
Although the two functions are used together, it does not affect the function.

How to visualize the result?

Hi, I am wondering whether you have a result of the visualization so that the effectiveness of the algorithm on the two new envs can be seen directly?
I noticed that in baselines,common.vec_env, the VecEnv class has not defined the Render function, is there any idea I can do the visualization?
Thank you so much!

Memory usage increases a lot when use the latest version of OpenAI baselines

Hi,

Regarding the OpenAI baselines, I know you recommand the commit bash 98257ef8c9bd23a24a330731ae54ed086d9ce4a7.

If I use the latest version of the OpenAI baselines, it will take much more memory (more than 10 times). Do you have any idea why this happens?

Thanks!

connection.py

Dear Shariq,

I am trying to run this code and I already have installed all of the requirements, but this is the error I have face:
python3.7/multiprocessing/connection.py", line 383, in _recv
raise EOFError
I was wondering if you could let me know how can I solve it.

Bests,
Azadeh

About SAC implementation

Hi, in your implementation, SAC is used but V is estimated by Q-function when updating critic and calculating target Q, instead of a separated value network in the original SAC paper. Would you please explain it or give some references? Thanks

How to visualize the attention weights between agents in the testing phase?

Hi, thanks for your great job! I have a question on how to visualize the attention weights between agents in the testing phase, i.e., Figure 6 in your article. Could you please give me some advice? Thank you very much!

when I run"python main.py fullobs_collect_treasure dir_1",there are the errors

Episodes 1-13 of 50000
Process Process-1:
Traceback (most recent call last):
File "/home/gezhixin/anaconda3/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/home/gezhixin/anaconda3/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/home/gezhixin/MAAC-master/utils/env_wrappers.py", line 20, in worker
ob = env.reset()
File "/home/gezhixin/anaconda3/lib/python3.6/site-packages/gym/core.py", line 66, in reset
raise NotImplementedError
NotImplementedError
Process Process-2:
Traceback (most recent call last):
File "/home/gezhixin/anaconda3/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/home/gezhixin/anaconda3/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/home/gezhixin/MAAC-master/utils/env_wrappers.py", line 20, in worker
ob = env.reset()
File "/home/gezhixin/anaconda3/lib/python3.6/site-packages/gym/core.py", line 66, in reset
raise NotImplementedError
NotImplementedError
Process Process-3:

VecEnv, why closing the Pipe()

Hi, in the code

class SubprocVecEnv(VecEnv):
    def __init__(self, env_fns, spaces=None):
        """
        envs: list of gym environments to run in subprocesses
        """
        self.waiting = False
        self.closed = False
        nenvs = len(env_fns)
        self.remotes, self.work_remotes = zip(*[Pipe() for _ in range(nenvs)])
        self.ps = [Process(target=worker, args=(work_remote, remote, CloudpickleWrapper(env_fn)))
            for (work_remote, remote, env_fn) in zip(self.work_remotes, self.remotes, env_fns)]
        for p in self.ps:
            p.daemon = True # if the main process crashes, we should not cause things to hang
            p.start()
        for remote in self.work_remotes:
            remote.close()

Why

        for remote in self.work_remotes:
            remote.close()

if the remote pipe is closed, how the message be sent?

Why does your implementation of MADDPG not work in your fork of MPE?

Hello! I run your MADDPG on the simple_spread environment in your fork of MPE, but it doesn't work as in the original MPE. Could you please help me fix this problem?

About the reporduction of experiment Cooperative Treasure Collection

Thank for sharing the source code of MAAC. This is a very interesting papar. When i reproduce the experiments, the result of Cooperative Treasure Collection is quite different from the paper's. The parameter of episode_length is 100, and the source code statistics are the average of 100 steps for each agent.
def get_average_rewards(self, N):
if self.filled_i == self.max_steps:
inds = np.arange(self.curr_i - N, self.curr_i) # allow for negative indexing
else:
inds = np.arange(max(0, self.curr_i - N), self.curr_i)
return [self.rew_buffs[i][inds].sum() for i in range(self.num_agents)]
Therefore, i sum the values of each agent to get the result shown in the figure.

So, i want to know how to calculate the results in original paper!
Hope for you reply!!!

Wei Zhou,
[email protected]

2 agents modification

I noticed that the code creashes when 2 agents are used. since there are problems with the dimension for the sum function in critics.py line 138.

I managed to sort it out in this way:

for i, a_i in enumerate(agents):
    if max(agents) == 1:
        head_entropies = [(-((probs + 1e-8).log() * probs).squeeze().sum(0)
        .mean()) for probs in all_attend_probs[i]]
    else:
        head_entropies = [(-((probs + 1e-8).log() * probs).squeeze().sum(1)
                        .mean()) for probs in all_attend_probs[i]]

Does it sound good by you?

When I run "python main.py fullobs_collect_treasure V1" I meet error "ImportError: cannot import name 'Wall'"

The error log is as follows:
File "", line 678, in exec_module
File "", line 219, in _call_with_frames_removed
File "D:\project_code\python\yiwei\MAAC-master\envs\mpe_scenarios\fullobs_collect_treasure.py", line 3, in
from multiagent.core import World, Agent, Landmark, Wall
ImportError: cannot import name 'Wall'

It seems that the there is no "Wall" class in the multiagent repo, i search the the core.py file in multiagent repo https://github.com/openai/multiagent-particle-envs and find World, Agent, Landmark. But there is no Wall class. Do you change your multiagent.core file? what is the defination of the Wall class? Thanks.

Memory Leak

Hi,

When I run the code under the fullobs_collect_treasure domain on cpu only, I noticed there was memory leak happening inside of model.update_critic and model.update_policies functions. Even though the buffer is fully filled, the memory usage will keep going up and finally use out of my memory. I don't know which line of the code leads to this problem.

Does anyone run into this issue? Thank you!

State Action Encoding in Critic

I was going through your code and I am having difficult time understanding 1 part of the critic. If you see line

MAAC/utils/critics.py

Line 148 in 1006cff

critic_in = torch.cat((s_encodings[i], *other_all_values[i]), dim=1)

you are just using a state encoding along with the joint embedding of all state-action pair of other agents as an input to the Q-function. If I recall correctly equation 5 takes an embedding of the current agents state-action pair alongside with joint embedding. Can you please explain what is going on here?

Is this code applicable to continuous actions？

question about reward

why the average rewards reported in the paper is much higher than the code. it's ~6 after training, but the paper is 125. Did you change the reward in the environment?
and at the end of each episode, for example, in the multi_speaker_listener environment, the listener cannot reach to its target position. Is this the same as your results?

How to visualize during training

How to visualize during training,and how to test the model

About query, key and value input embedding

In the code:
the input of sel_ext(query) is state_encodings
the input of k_ext(key) is state_action_encodings
the input of v_ext(value) is state_action_encodings
In the paper, the input of key and query should be state_action_encodings.

I think the correct input should be
the input of sel_ext(query) is state_action_encodings(change)
the input of k_ext(key) is state_action_encodings
the input of v_ext(value) is state_encodings(change)

Could you explain why this is done in the code?

About the results

Hello, thank you very much for being able to open source the code for this paper. This is a very good job!

When running this code, for the Cooperative Treasure Collection multi-agent environment, my results are as follows:

These results are quite different from the average reward in the paper, which is about 100, and I have not changed any parameters. Is there any special place in the calculation method of average reward?

How to evaluate and test the training result?

Hi,
Your codes are written very well~ Thanks for your great job! But in all folders, I only found codes for training the model and I didn’t find the codes for testing or evaluating the trained model in environments , would you mind uploading the evaluation codes file for testing or evaluating?
Thank you~

How to run your code in other scenarios, i.e., cooperative navigation

Hi, thanks for your great job! I met a problem when I run your code in other scenarios, i.e., simple_spread. The command I used is " python main.py simple_spread dataxx --use_gpu". Is it caused by the version of gym. I have tested both gym 0.9.4 and 0.12.5, but the following problem still exists. Could you please give me some advice on the problem? Thanks very much!

hi，my machine memory always overflow when I run your code,but I don't find the reason. Can you help me?

Seeding fails to produce deterministic results

Hi, thanks for this great code. I've been using it for some experiments and have been having some issues with replicability. One thing I notice is that learning curves are different even when I try and set the same random seed. I still get different results even if I do:

    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)  # if you are using multi-GPU.
    np.random.seed(seed)  # Numpy module.
    random.seed(seed)  # Python random module.
    torch.manual_seed(seed)
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.deterministic = True

pytorch/pytorch#7068

Have you noticed this issue and is there a way to resolve it?

Also, a quick unrelated question: for Treasure Collection I notice that by default you use substantial reward shaping. Was this shaping used for the Actor-Attention-Critic paper? Were you able to successfully train without it?

cannot handle some scenarios

hello，I try to run your code in simple_reference and simpel_world_com. However, the code reported error. Can MAAC deal with the both physical actions and communication actions scenairos？

issue thanks!

Ask for advice inps (list of PyTorch Matrices): Inputs to each agents' encoder (batch of obs + ac) How is this parameter set .
A=np.mat(obs1)
B=np.mat(action) ---- --> states = [s for s, a in inps] ValueError: too many values to unpack (expected 2)
inps=list((A,B))

Looking forward to your reply !
o_dim = 12
a_dim = 6

Critic encoders as shared modules ?

Dear Shariq,

In your article "Actor-Attention-Critic for Multi-Agent Reinforcement Learning", figure 1., there is a MLP, which is said to be unique per agent according to the legend (blue background color), that realize a state-action encoding from (o_i, a_i) to e_i.

I believed this corresponds to what is named "critic_encoders" in your code. However, these encoders are listed in the shared_modules here:

MAAC/utils/critics.py

Line 73 in 6174a01

self.value_extractors, self.critic_encoders]

Is it normal ? From my understanding, the consequence from being listed in shared_modules is that the gradients are scaled by (1/agents_numbers), so I believe this would have only minor consequences on the behavior of the algorithm.

Best regards

How does the gradient back-propagate from Q to the action $a_i$？

I wonder how the gradient back propagate from Q to $a_i$.
Trace from Q:

MAAC/utils/critics.py

Lines 149 to 150 in 105d60e

 all_q = self.critics[a_i](critic_in) 

 int_acs = actions[a_i].max(dim=1, keepdim=True)[1]

Then trace critic_in:

MAAC/utils/critics.py

Line 148 in 105d60e

critic_in = torch.cat((s_encodings[i], *other_all_values[i]), dim=1)

Since s_encoding doesn't contain input from $a_i$, I then trace other_all_values[i]:

MAAC/utils/critics.py

Lines 125 to 141 in 105d60e

 for curr_head_keys, curr_head_values, curr_head_selectors in zip( 

 all_head_keys, all_head_values, all_head_selectors): 

 # iterate over agents 

 for i, a_i, selector in zip(range(len(agents)), agents, curr_head_selectors): 

 keys = [k for j, k in enumerate(curr_head_keys) if j != a_i] 

 values = [v for j, v in enumerate(curr_head_values) if j != a_i] 

 # calculate attention across agents 

 attend_logits = torch.matmul(selector.view(selector.shape[0], 1, -1), 

 torch.stack(keys).permute(1, 2, 0)) 

 # scale dot-products by size of key (from Attention is All You Need) 

 scaled_attend_logits = attend_logits / np.sqrt(keys[0].shape[1]) 

 attend_weights = F.softmax(scaled_attend_logits, dim=2) 

 other_values = (torch.stack(values).permute(1, 2, 0) * 

 attend_weights).sum(dim=2) 

 other_all_values[i].append(other_values) 

 all_attend_logits[i].append(attend_logits) 

 all_attend_probs[i].append(attend_weights)

keys and values don't contain agent i's action as input, and selector uses only observations as input:

MAAC/utils/critics.py

Lines 118 to 119 in 105d60e

 all_head_selectors = [[sel_ext(enc) for i, enc in enumerate(s_encodings) if i in agents] 

 for sel_ext in self.selector_extractors]

So, is there gradient from Q to action $a_i$?

About the code

Hi,
I'm very sorry to trouble you. I am reading your paper MAAC and I find that the result is compare with COMA. So I want to ask whether you can open the code about COMA.

Thanks!

does training advantage soft actor critic based on replay buffer has large bias?

Dear Author,

I take a fast look at your code on actor updates. It seems that you have use advantage soft actor critic, i.e.,
Advantage: pol_target = q - v
loss: pol_loss = (log_pi * (log_pi / self.reward_scale - pol_target).detach()).mean()

If you use the above updates, I think it's an on-policy soft A2C, therefore unbiased actor should only be updated based on the incremental data rather than the data from replay buffer. Otherwise, it will be an biased estimate of the real policy. Right?

Best,
Hui

How to solve env_id?

main.py: error: the following arguments are required: env_id, model_name

Problem of optimizing policy

Hi Shariq, first thank you for your code! And it works well. But when optimizing policy, shouldn't it be probs * (-pol_target)? Why we use log_pi here?

MAAC/algorithms/attention_sac.py

Line 150 in bd263af

pol_loss = (log_pi * (-pol_target).detach()).mean()

Where is the code to load the model？

Bias on value extractors ?

Dear Shariq,

In your article, there is no bias used to calculate the x_i. However, in the code the bias is not set to False, and I believe that the default value is True, for the value extractors:

MAAC/utils/critics.py

Line 68 in 6174a01

self.value_extractors.append(nn.Sequential(nn.Linear(hidden_dim,

Is there a reason for that ? Thank you.

About environment

This unwrapped multiagent environment has abstract reset() method.
worker() method in make_env.py may be modified as follows:

errors occur without these modifications in my python3.5. what about yours?

multi-agent particle environments

when i run your multi-agent particle environments,the error:
Traceback (most recent call last):
File "/home/cherry/multiagent-particle-envs-master/bin/interactive.py", line 26, in
env.render()
File "/home/cherry/anaconda3/envs/shyang/lib/python3.6/site-packages/gym/core.py", line 108, in render
raise NotImplementedError
NotImplementedError

	all_q = self.critics[a_i](critic_in)
	int_acs = actions[a_i].max(dim=1, keepdim=True)[1]

	for curr_head_keys, curr_head_values, curr_head_selectors in zip(
	all_head_keys, all_head_values, all_head_selectors):
	# iterate over agents
	for i, a_i, selector in zip(range(len(agents)), agents, curr_head_selectors):
	keys = [k for j, k in enumerate(curr_head_keys) if j != a_i]
	values = [v for j, v in enumerate(curr_head_values) if j != a_i]
	# calculate attention across agents
	attend_logits = torch.matmul(selector.view(selector.shape[0], 1, -1),
	torch.stack(keys).permute(1, 2, 0))
	# scale dot-products by size of key (from Attention is All You Need)
	scaled_attend_logits = attend_logits / np.sqrt(keys[0].shape[1])
	attend_weights = F.softmax(scaled_attend_logits, dim=2)
	other_values = (torch.stack(values).permute(1, 2, 0) *
	attend_weights).sum(dim=2)
	other_all_values[i].append(other_values)
	all_attend_logits[i].append(attend_logits)
	all_attend_probs[i].append(attend_weights)

	all_head_selectors = [[sel_ext(enc) for i, enc in enumerate(s_encodings) if i in agents]
	for sel_ext in self.selector_extractors]

shariqiqbal2810 / maac Goto Github PK

maac's Introduction

Multi-Actor-Attention-Critic

Requirements

How to Run

Citing our work

maac's People

Contributors

Stargazers

Watchers

Forkers

maac's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs