shariqiqbal2810 / maddpg-pytorch Goto Github PK

PyTorch Implementation of MADDPG (Lowe et. al. 2017)

License: MIT License

Python 100.00%

maddpg-pytorch's Introduction

MADDPG-PyTorch

PyTorch Implementation of MADDPG from Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments (Lowe et. al. 2017)

Requirements

OpenAI baselines, commit hash: 98257ef8c9bd23a24a330731ae54ed086d9ce4a7
My fork of Multi-agent Particle Environments
PyTorch, version: 0.3.0.post4
OpenAI Gym, version: 0.9.4
Tensorboard, version: 0.4.0rc3 and Tensorboard-Pytorch, version: 1.0 (for logging)

The versions are just what I used and not necessarily strict requirements.

How to Run

All training code is contained within main.py. To view options simply run:

python main.py --help

Results

Physical Deception

In this task, the two blue agents are rewarded by minimizing the closest of their distances to the green landmark (only one needs to be close to get optimal reward), while maximizing the distance of the red adversary from the green landmark. The red adversary is rewarded by minimizing it's distance to the green landmark; however, on any given trial, it does not know which landmark is green, so it must follow the blue agents. As such, the blue agents should learn to deceive the red agent by covering both landmarks.

Cooperative Communication

This task involves two agents, one that is stationary and one that can move. The stationary agent sees the color of the other agent as its observation, and outputs a one-hot communication vector as its action. The moving agent receives the communication vector, as well as its relative distance to all landmarks on the screen; however, it does not know its own color. The goal of both agents is for the moving agent to reach the landmark that matches its own color. Thus, the agents must learn to communicate such that the moving agent knows where to go on each randomized trial.

Predator-Prey

This task involves a single prey agent (in green) and a team of three predators (in red). The prey agent is 30% faster than the predators, so the predators must learn how to team up in order to catch the prey.

In the trials below, the prey agent uses DDPG as its learning algorithm.

Not Implemented

There are a few items from the paper that have not been implemented in this repo

Ensemble Training
Inferring other agents' policies
Mixed continuous/discrete action spaces

Acknowledgements

The OpenAI baselines Tensorflow implementation and Ilya Kostrikov's Pytorch implementation of DDPG were used as references. After the majority of this codebase was complete, OpenAI released their code for MADDPG, and I made some tweaks to this repo to reflect some of the details in their implementation (e.g. gradient norm clipping and policy regularization).

maddpg-pytorch's People

Contributors

Stargazers

Watchers

Forkers

dkkim93 fixingitnow samirw awm182 avijit9 jeremiemelo anirudhajitani wangyy161 shamcondor yinjiangjin arthurmensch hatleon laurathepluralized sharmisthasg chloe4d rainwangphy gouet jamesdi1993 liluoniuniu gutsy-robot barrycxg tao2020 jiayu-ch15 maxmax1992 dwuuu llt1 chelseahendrix ankur-deka pwang724 zddkjmuner andretag laukikm ryrobotics arshsagar anibadde tessavdheiden nuaasgq zhangtjtongxue xuezzee chengshuli yutongamber goldenwalden ewanlee inhouse-banana raharrasy tangashley nishantkr18 532358119 yxk0423 distributed-deep-learning kevinjeon anaskn horacepan aks-dmv l-spike willook joshwlks 51n84d aicools rui-uestc zixianma jc-bao djmartingale lumens-g stephmilani wumin12 liucheng369 mmrslwan9711 fanniao727 devin521314 wdrdg stevenyuan666 jwparks ltcsoar zijiwang parenthetical-e evangeline98 ustc-lizheng lion-liucheng cswangle goldtomor qiufengsly lvchenyangai tianyu-z niceboy120 xiaoyangyang2 war3gu soopark0221 lucyhalperin cyma116 5torch1024 timckai stephlee12 1229685850 filpag 782169620 campbell-rankine leoking0001 mejirosilence projecttopstep

maddpg-pytorch's Issues

Why doing one update per rollout_thread?

Thanks for the great code!

I've got more of a question than an issue: in main.py you do one update per rollout thread

for u_i in range(config.n_rollout_threads):
        for a_i in range(maddpg.nagents):
                sample = replay_buffer.sample(config.batch_size, to_gpu=USE_CUDA)
                maddpg.update(sample, a_i, logger=logger)
        maddpg.update_all_targets()

And then in the update method for MADDPG agents, you average the gradients:

if parallel:
        average_gradients(curr_agent.critic)

Why doing that rather than just a single update?

You use synchronous rollout threads that wait for each other at each timestep, and put all of their collected transitions in a single buffer. You then train using the master-thread only on this single buffer. Since your condition to do an update is to do one every config.steps_per_update, the way I see it, it would be less wasteful to just do a single update when this condition is True, and the fact that you have multiple rollout threads would simply help collect the required transitions faster.

Am I missing something?

How to visualize the learnt policies

Hi, how do I get the visualizations like in assets folder for the policy learnt?

Thanks

how to render the result by the trained model.

how to render the result by the trained model?

How to use Wall object to create environments.

Hi @shariqiqbal2810, great repo!

I am not sure how to use the Wall object that you added, in order to create a scenario similar to simple_tag, but bounded with a wall that cannot be passed in order for the agents not to be able to "escape" the rendering window. I have read that it is very common for agents to run to infinity outiside of the rendering box, and I want to speed up training.

I would really appreciate it if you could give me some feedback on that!

Speaker not outputting discrete values

In the 'Cooperative Communication' scenario, the speaker agent is also made continuous, which shouldn't be the case. The listener agent has continuous actions. The simulation still works, but I think forcing the speaker to output discrete values is more appropriate

confusing about agent.state.c.

Dear Author,

In the environment, such as speaker and listener, the listener can obtain the speakers state.c as the observation, however, i don't find where the state.c is updated?

Where are you exploring?

Hi!

Great repo, I am glad that you implemented the code in Pytorch!

If you use exploration=True, are you really exploring? Epsilon=0.0 always, so you would never use a random selected action, or am I incorrect?

step called:

maddpg-pytorch/algorithms/maddpg.py

Line 79 in 40388d7

return [a.step(obs, explore=explore) for a, obs in zip(self.agents,
action computed with exploration=True:

maddpg-pytorch/utils/agents.py

Line 55 in 40388d7

def step(self, obs, explore=False):
action selected, but not with epsilon:

maddpg-pytorch/utils/agents.py

Line 67 in 40388d7

action = gumbel_softmax(action, hard=True)

Using Gumbel-softmax to solve more complex environment

Have you tried to use Gumbel-Softmax to solve discrete action environments other than the particle environment? I tried to apply it to Starcraft II but its performance is really bad.

Problems when using Discrete Action

Hi, Thanks for the pytorch version of MADDPG. I notice that it is important to use --discrete_action. But when I add this option, there is something wrong.
Here is my command
python main.py simple_tag test --discrete_action
And there is my error

Traceback (most recent call last):
File "main.py", line 155, in
run(config)
File "main.py", line 100, in run
maddpg.update(sample, a_i, logger=logger)
File "/home/maddpg-pytorch/algorithms/maddpg.py", line 143, in update
curr_pol_vf_in = gumbel_softmax(curr_pol_out, hard=True)
File "/home/maddpg-pytorch/utils/misc.py", line 88, in gumbel_softmax
y = gumbel_softmax_sample(logits, temperature)
File "/home/maddpg-pytorch/utils/misc.py", line 73, in gumbel_softmax_sample
y = logits + sample_gumbel(logits.shape, tens_type=type(logits.data))
RuntimeError: expected device cuda:0 but got device cpu

I have make USE_CUDA = True.

Thanks for your help!

Usage of Scenarios MADDPG Pytorch

Hello,

I want to use your algorithm MADDPG but I don't know how use it ?

Can you please tell me the necessary parameters to use "simple_speaker_listener" scenarios for example ?

Thank you in advance.

Best Regards,

Abdelali

Multi-discrete action spaces not working

Hi!

I was looking into simple_reference.py, which has multiple (2) discrete action spaces (communication + movement). Unfortunately the algorithm doesnt work.

Specifically this line breaks.

Best,
Tessa

No module named 'baselines'

when i tried to run the main.py
it occurs an error
from baselines.common.vec_env import VecEnv, CloudpickleWrapper ModuleNotFoundError: No module named 'baselines

would someone tell me how to solve it

gumbel_softmax during update

Hi, thanks for this great repo!

Quick question, should this line be hard=False instead? Otherwise, how does the gradient flow through?

maddpg-pytorch/algorithms/maddpg.py

Line 143 in 40388d7

curr_pol_vf_in = gumbel_softmax(curr_pol_out, hard=True)

Thanks a lot for your help!!

Avg reward different than paper

Hi!

Great repo, also neat evaluation (evaluation.py and tensorboardX in main).

The only thing that I do not understand is why the paper Fig 4. has an average episodic reward of -20 and in tensorboard it is in the order of -2.

Please see below Fig 4 from the paper and my tensorboard output:

Question about some arguments.

Hi Shariq,

I'm a little confused about the 'n_rollout_threads' and 'n_training_threads'.

In your code, n_rollout_threads is set to 1. If I understand correctly, this indicates using DummyVecEnv without collecting experiences in parallel. Any insight why not using SubprocVecEnv like in a2c?

And what about n_training_threads? Is this setting only for acceleration?

Thanks a lot!

parameter configurations that lead to readme results

Did not obtain the beautiful speaker listener results after running it with default settings (and discrete action space).

Could you share the exact settings that lead to the replication of these results?

env problem

cooperative communication env did not work.
return as follow:

Traceback (most recent call last):
File "main.py", line 155, in
run(config)
File "main.py", line 57, in run
hidden_dim=config.hidden_dim)
File "/home/nuc/causal_ws/maddpg-pytorch/algorithms/maddpg.py", line 256, in init_from_env
num_in_critic += get_shape(oacsp)
File "/home/nuc/causal_ws/maddpg-pytorch/algorithms/maddpg.py", line 249, in
get_shape = lambda x: x.n
AttributeError: 'Box' object has no attribute 'n'

Config for demo tasks

Hi shariqiqbal2810,

First of all thanks a lot for sharing this great implementation. It is super clear and very enlightening! :)

Would it be possible for you to provide some suggested config (number of training episodes, learning rates, etc.) to get the results that you showcase in the readme?

Or do you use exactly the same hyperparams than in the original paper?

I have run some scenarios with the default configs and get much worse results than you.

Thanks a lot!

About optim

Here is a note from Pytorch document, https://pytorch.org/docs/1.2.0/optim.html#constructing-it.
If you need to move a model to GPU via .cuda(), please do so before constructing optimizers for it. Parameters of a model after .cuda() will be different objects with those before the call.

In general, you should make sure that optimized parameters live in consistent locations when optimizers are constructed and used.

I noted that you can .cuda() or .cpu() after the optimizers are constructed, but the algorithm still work. I can't figure out it.

Updat value function with different action types, why?

Hi Shariq,

In your code you update the value function with actions computed by:

As far as I know, 1) has the gradient attached, while 2) does not.

Why did you implemented it this way?

Have you finished Ensemble Training?

what's black block stand for?

Dear author,

your project is helpful to the research community. just want to ask what are the black blocks in predator-prey? some forbidden areas for both predator and prey?

no detach in critic network, Is it a bug?

Dear Authors,

Thanks a lot for your released code. It's very helpful.

I notice that when updating parameters in current_agent's critic, you didn't detach other agents' action.

all_pol_acs.append(onehot_from_logits(pi(ob)))

all_pol_acs.append(pi(ob))

Does it should be

all_pol_acs.append(onehot_from_logits(pi(ob).detach()))

all_pol_acs.append(pi(ob).detach())

According to my acknowledgement, you update each agent's critic separately. If so, other gent's critic should be fixed. Right?

About policy loss

Could you tell me how to understand this "pol_loss += (curr_pol_out**2).mean() * 1e-3" in policy loss?

Update MultiAgentEnv construction

Hi!

Here you make an instance of the env:

maddpg-pytorch/utils/make_env.py

Line 47 in 40388d7

discrete_action=discrete_action)

you input a discrete_action, but this is not consistent with the multi-agent env:
https://github.com/openai/multiagent-particle-envs/blob/69ee7f85811c77ee651722bc3c332677b2195da1/multiagent/environment.py#L14

Maybe you forgot to update it?

Best,
Tessa

Hi, i got a problem when i try to excute main.py

The log is as follows:

$ python main.py simple_push push_model
E:\anaconda\lib\site-packages\gym\logger.py:30: UserWarning: WARN: Box bound precision lowered by casting to float32
  warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow'))
Episodes 1-2 of 25000
Traceback (most recent call last):
  File "main.py", line 166, in <module>
    run(config)
  File "main.py", line 72, in run
    obs = env.reset()
  File "D:\GithubProject\py\maddpg-pytorch-master\utils\env_wrappers.py", line 125, in reset
    results = [env.reset() for env in self.envs]
  File "D:\GithubProject\py\maddpg-pytorch-master\utils\env_wrappers.py", line 125, in <listcomp>
    results = [env.reset() for env in self.envs]
  File "E:\anaconda\lib\site-packages\gym\core.py", line 66, in reset
    raise NotImplementedError
NotImplementedError

what's wrong with the notimplementedError, Thank you very much for your answer！

Does your implementation can reproduce the results reported in the MADDPG paper?

Dear Author,

I notice that OpenAI has also released the source code of MADDPG. However, many researchers claimed that they cannot reproduce the results claimed in the original paper as Fig.4 (openai/maddpg#23). I wonder whether your implementation could reproduce the results in scenario=simple_speaker_listener.

Thanks.
Hui

Speaker's messages sufficient to communicate goal?

Hi!

The observations of the listener are the messages of the speaker.

This should be sufficient to communicate goal location (one out of three landmarks). However, the speaker only observes the goal color. Which is nothing more than the color of the target landmark.

How can this be sufficient to guide the listener?

The speaker does not know the relative position of the listener (relative to the target landmark), so it cannot say something like "up, down, up, up". Or alternatively, the listener does not see the landmark colors, so it cannot understand messages like "red"....

Hope this can be clearified!

how to make the visualization (gif) using the trained model?

I have trained the MADDPG model and I want to visualize the results as you have shown in the repo. Do you have code to do that or can you give me pointers to how to do it from the trained model?

Replicating results

Hi,
which setting do you use to replicate the paper results?

On simple_tag scenario the best configuration so far is:

128 unites 2 layers MLP for policy
64 unites 1 layer MLP for critic
60k episodes

When I run the model, after the training I get some good cases (like those ones you showed here, but even lots of cases where it doesn't work so well. Is it quite normal or I'm doing something wrong?

I still not able to replicate results for the simple_speaker_listener scenario.

Hello! I am a graduate student at school, I have a problem running the main program, I want to ask you

Hello! When I run python main.py directly when running the main program, the system prompts main.py: error: the following arguments are required: env_id, model_name. Excuse me, how should I choose to enter these two parameters? Looking forward to your reply, thank you very much!

How to set env_id, model_name in train.py

hi,
I‘m running into the following error:
parser.add_argument("env_id", type=str, default="simple_tag", help="name of the scenario script")
parser.add_argument("model_name", default="./model/",
help="Name of directory to store " +
"model/training contents")
main.py: error: the following arguments are required: env_id, model_name

and when simple run: python main.py --env_id simple_tag --model_name ./model/
the error will be pops up:main.py: error: unrecognized arguments: --env_id --model_name

I am not clear how to set these arguments, would you help leave me an example to add_argument run main.py?

Thanks.

Actions change when performing step in the environment

I noticed that in line 86 of the main.py file, the values in the variable agent_actions change. These changed values are then being fed into the replay buffer which seems wrong to me.

Trouble setting up the environment

Hi can you add a requirements file of your environment or a setup script?
I am having trouble setting up the same for running the code.

Thanks!

Problem in building env

Hi, thank your for your public reproducing project. I just followed your guidance including downloading multi-agent environment. But there is an error when ran the main.py.

File "/home/l/anaconda3/lib/python3.7/site-packages/gym/core.py", line 64, in reset
    raise NotImplementedError
NotImplementedError

Looking forward to your reply.

Discrete action

Hi shariqiqbal2810, thanks a lot for sharing this MADDPG's implementation. Following your default parameter setting, I ran the gym scenarios ( simple or simple_spread ) for many times. But all these results showed that the agents did not learn a good policy ( For example, the rewards in the simple_spread basically are converged to -3.).

I found a difference between your implementation and OpenAI's implementation when setting the discrete action:
OpenAI's implementation: #tensor([[ 0.0125, -0.0128, 0.0105, 0.1107, -0.0798]]
This implementation: one-hot #tensor([[0., 1., 0., 0., 0.]].
It does not seem to be the root cause. Because I change this discrete action, the experimental results were still bad.

For training a good agent, could you give me some advice?

Does shared weights among actors matter?

Hi @shariqiqbal2810, thanks for the nice implementation.
I spotted a difference between your implementation and the official one. In the official code (@openai/maddpg), the actors' weights are shared, while your implementation does not.

Do you spot any difference? Such as unstable learning, slow convergence, or more diverse behavior.
Thanks in advance.

Incorrect actor update

Hi Shariq,
In my opinion, your actor update isn't right. In the paper we should use actor only for actor_id and other actions we should use from buffer. Am I right?

About discrete case

hi, could you tell me in the discrete case , how to explore the action space, just use gumbel_softmax_sample? I find the self.exploration is not used?

Implementing request: Simple reference & move and communicate ability?

Hi Shariq,

You've (unintentionally?) shut-off the shared_reward, which is why simple_reference, in which agents both move and speak will not work.

self.shared_reward = world.collaborative if hasattr(world, 'collaborative') else False

Moreover, it requires actions to be an instance of the class MultiDiscrete in gym.spaces libary. Maybe you need to use this as well (environment.py):
if all([isinstance(act_space, spaces.Discrete) for act_space in total_action_space]): act_space = spaces.MultiDiscrete([act_space.n for act_space in total_action_space])
Shall I make a pull-request with the changes?

Update order of actor and critic

It seems that you update critic before actor.

As far as I know, the actor_loss is calculated through critic network, so the backward of actor_loss will influence the grad of critic parameters.

Should we update actor first, and then update critic using both actor_loss and critic_loss?

Hello! When I run the following program according to your prompt, I get an error again, and I want to ask you

Hello! When I run the following program according to your prompt, the following error occurs again. What is the reason?
python main.py simple_world_comm predator-prey
Traceback (most recent call last):
File "main.py", line 155, in
run(config)
File "main.py", line 52, in run
config.discrete_action)
File "main.py", line 26, in make_parallel_env
return DummyVecEnv([get_env_fn(0)])
File "/home/dcy/MADDPG/maddpg-pytorch-master/utils/env_wrappers.py", line 99, in init
self.envs = [fn() for fn in env_fns]
File "/home/dcy/MADDPG/maddpg-pytorch-master/utils/env_wrappers.py", line 99, in
self.envs = [fn() for fn in env_fns]
File "main.py", line 20, in init_env
env = make_env(env_id, discrete_action=discrete_action)
File "/home/dcy/MADDPG/maddpg-pytorch-master/utils/make_env.py", line 47, in make_env
discrete_action=discrete_action)
TypeError: init() got an unexpected keyword argument 'discrete_action'

Benchmark and obtain table

Hi Shariq,

Great repo. The scenarios have benchmark data, but in your code and in the tensorflow repo, there is no script that analyses this data.

Also, to get the numbers for Table 2 (Appendix), I only count the collisions once per epoch (so if agents collide multiple times, I discard it). Regarding the avg_dist, I only check the distance at the end of an episode.

Do you have any knowledge on how the numbers of Table 2 can be obtained?

shariqiqbal2810 / maddpg-pytorch Goto Github PK

maddpg-pytorch's Introduction

MADDPG-PyTorch

Requirements

How to Run

Results

Physical Deception

Cooperative Communication

Predator-Prey

Not Implemented

Acknowledgements

maddpg-pytorch's People

Contributors

Stargazers

Watchers

Forkers

maddpg-pytorch's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs