GithubHelp home page GithubHelp logo

maddpg-pytorch's Introduction

MADDPG-PyTorch

PyTorch Implementation of MADDPG from Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments (Lowe et. al. 2017)

Requirements

The versions are just what I used and not necessarily strict requirements.

How to Run

All training code is contained within main.py. To view options simply run:

python main.py --help

Results

Physical Deception

In this task, the two blue agents are rewarded by minimizing the closest of their distances to the green landmark (only one needs to be close to get optimal reward), while maximizing the distance of the red adversary from the green landmark. The red adversary is rewarded by minimizing it's distance to the green landmark; however, on any given trial, it does not know which landmark is green, so it must follow the blue agents. As such, the blue agents should learn to deceive the red agent by covering both landmarks.

Cooperative Communication

This task involves two agents, one that is stationary and one that can move. The stationary agent sees the color of the other agent as its observation, and outputs a one-hot communication vector as its action. The moving agent receives the communication vector, as well as its relative distance to all landmarks on the screen; however, it does not know its own color. The goal of both agents is for the moving agent to reach the landmark that matches its own color. Thus, the agents must learn to communicate such that the moving agent knows where to go on each randomized trial.

Predator-Prey

This task involves a single prey agent (in green) and a team of three predators (in red). The prey agent is 30% faster than the predators, so the predators must learn how to team up in order to catch the prey.

In the trials below, the prey agent uses DDPG as its learning algorithm.

Not Implemented

There are a few items from the paper that have not been implemented in this repo

  • Ensemble Training
  • Inferring other agents' policies
  • Mixed continuous/discrete action spaces

Acknowledgements

The OpenAI baselines Tensorflow implementation and Ilya Kostrikov's Pytorch implementation of DDPG were used as references. After the majority of this codebase was complete, OpenAI released their code for MADDPG, and I made some tweaks to this repo to reflect some of the details in their implementation (e.g. gradient norm clipping and policy regularization).

maddpg-pytorch's People

Contributors

shariqiqbal2810 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

maddpg-pytorch's Issues

Why doing one update per rollout_thread?

Thanks for the great code!

I've got more of a question than an issue: in main.py you do one update per rollout thread

for u_i in range(config.n_rollout_threads):
        for a_i in range(maddpg.nagents):
                sample = replay_buffer.sample(config.batch_size, to_gpu=USE_CUDA)
                maddpg.update(sample, a_i, logger=logger)
        maddpg.update_all_targets()

And then in the update method for MADDPG agents, you average the gradients:

if parallel:
        average_gradients(curr_agent.critic)

Why doing that rather than just a single update?

You use synchronous rollout threads that wait for each other at each timestep, and put all of their collected transitions in a single buffer. You then train using the master-thread only on this single buffer. Since your condition to do an update is to do one every config.steps_per_update, the way I see it, it would be less wasteful to just do a single update when this condition is True, and the fact that you have multiple rollout threads would simply help collect the required transitions faster.

Am I missing something?

How to use Wall object to create environments.

Hi @shariqiqbal2810, great repo!

I am not sure how to use the Wall object that you added, in order to create a scenario similar to simple_tag, but bounded with a wall that cannot be passed in order for the agents not to be able to "escape" the rendering window. I have read that it is very common for agents to run to infinity outiside of the rendering box, and I want to speed up training.

I would really appreciate it if you could give me some feedback on that!

Speaker not outputting discrete values

In the 'Cooperative Communication' scenario, the speaker agent is also made continuous, which shouldn't be the case. The listener agent has continuous actions. The simulation still works, but I think forcing the speaker to output discrete values is more appropriate

confusing about agent.state.c.

Dear Author,

In the environment, such as speaker and listener, the listener can obtain the speakers state.c as the observation, however, i don't find where the state.c is updated?

Where are you exploring?

Hi!

Great repo, I am glad that you implemented the code in Pytorch!

If you use exploration=True, are you really exploring? Epsilon=0.0 always, so you would never use a random selected action, or am I incorrect?

Problems when using Discrete Action

Hi, Thanks for the pytorch version of MADDPG. I notice that it is important to use --discrete_action. But when I add this option, there is something wrong.
Here is my command
python main.py simple_tag test --discrete_action
And there is my error

Traceback (most recent call last):
File "main.py", line 155, in
run(config)
File "main.py", line 100, in run
maddpg.update(sample, a_i, logger=logger)
File "/home/maddpg-pytorch/algorithms/maddpg.py", line 143, in update
curr_pol_vf_in = gumbel_softmax(curr_pol_out, hard=True)
File "/home/maddpg-pytorch/utils/misc.py", line 88, in gumbel_softmax
y = gumbel_softmax_sample(logits, temperature)
File "/home/maddpg-pytorch/utils/misc.py", line 73, in gumbel_softmax_sample
y = logits + sample_gumbel(logits.shape, tens_type=type(logits.data))
RuntimeError: expected device cuda:0 but got device cpu

I have make USE_CUDA = True.

Thanks for your help!

Usage of Scenarios MADDPG Pytorch

Hello,

I want to use your algorithm MADDPG but I don't know how use it ?

Can you please tell me the necessary parameters to use "simple_speaker_listener" scenarios for example ?

Thank you in advance.

Best Regards,

Abdelali

No module named 'baselines'

when i tried to run the main.py
it occurs an error
from baselines.common.vec_env import VecEnv, CloudpickleWrapper ModuleNotFoundError: No module named 'baselines

would someone tell me how to solve it

Avg reward different than paper

Hi!

Great repo, also neat evaluation (evaluation.py and tensorboardX in main).

The only thing that I do not understand is why the paper Fig 4. has an average episodic reward of -20 and in tensorboard it is in the order of -2.

Please see below Fig 4 from the paper and my tensorboard output:
Screenshot 2020-05-28 at 14 28 07

Screenshot 2020-05-28 at 14 34 25

Question about some arguments.

Hi Shariq,

I'm a little confused about the 'n_rollout_threads' and 'n_training_threads'.

In your code, n_rollout_threads is set to 1. If I understand correctly, this indicates using DummyVecEnv without collecting experiences in parallel. Any insight why not using SubprocVecEnv like in a2c?

And what about n_training_threads? Is this setting only for acceleration?

Thanks a lot!

env problem

cooperative communication env did not work.
return as follow:

Traceback (most recent call last):
File "main.py", line 155, in
run(config)
File "main.py", line 57, in run
hidden_dim=config.hidden_dim)
File "/home/nuc/causal_ws/maddpg-pytorch/algorithms/maddpg.py", line 256, in init_from_env
num_in_critic += get_shape(oacsp)
File "/home/nuc/causal_ws/maddpg-pytorch/algorithms/maddpg.py", line 249, in
get_shape = lambda x: x.n
AttributeError: 'Box' object has no attribute 'n'

Config for demo tasks

Hi shariqiqbal2810,

First of all thanks a lot for sharing this great implementation. It is super clear and very enlightening! :)

Would it be possible for you to provide some suggested config (number of training episodes, learning rates, etc.) to get the results that you showcase in the readme?

Or do you use exactly the same hyperparams than in the original paper?

I have run some scenarios with the default configs and get much worse results than you.

Thanks a lot!

About optim

Here is a note from Pytorch document, https://pytorch.org/docs/1.2.0/optim.html#constructing-it.
If you need to move a model to GPU via .cuda(), please do so before constructing optimizers for it. Parameters of a model after .cuda() will be different objects with those before the call.

In general, you should make sure that optimized parameters live in consistent locations when optimizers are constructed and used.

I noted that you can .cuda() or .cpu() after the optimizers are constructed, but the algorithm still work. I can't figure out it.

what's black block stand for?

Dear author,

your project is helpful to the research community. just want to ask what are the black blocks in predator-prey? some forbidden areas for both predator and prey?

no detach in critic network, Is it a bug?

Dear Authors,

Thanks a lot for your released code. It's very helpful.

I notice that when updating parameters in current_agent's critic, you didn't detach other agents' action.

all_pol_acs.append(onehot_from_logits(pi(ob)))

all_pol_acs.append(pi(ob))

Does it should be

all_pol_acs.append(onehot_from_logits(pi(ob).detach()))

all_pol_acs.append(pi(ob).detach())

According to my acknowledgement, you update each agent's critic separately. If so, other gent's critic should be fixed. Right?

About policy loss

Could you tell me how to understand this "pol_loss += (curr_pol_out**2).mean() * 1e-3" in policy loss?

Hi, i got a problem when i try to excute main.py

The log is as follows:

$ python main.py simple_push push_model
E:\anaconda\lib\site-packages\gym\logger.py:30: UserWarning: WARN: Box bound precision lowered by casting to float32
  warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow'))
Episodes 1-2 of 25000
Traceback (most recent call last):
  File "main.py", line 166, in <module>
    run(config)
  File "main.py", line 72, in run
    obs = env.reset()
  File "D:\GithubProject\py\maddpg-pytorch-master\utils\env_wrappers.py", line 125, in reset
    results = [env.reset() for env in self.envs]
  File "D:\GithubProject\py\maddpg-pytorch-master\utils\env_wrappers.py", line 125, in <listcomp>
    results = [env.reset() for env in self.envs]
  File "E:\anaconda\lib\site-packages\gym\core.py", line 66, in reset
    raise NotImplementedError
NotImplementedError

what's wrong with the notimplementedError, Thank you very much for your answer!

Speaker's messages sufficient to communicate goal?

Hi!

The observations of the listener are the messages of the speaker.

This should be sufficient to communicate goal location (one out of three landmarks). However, the speaker only observes the goal color. Which is nothing more than the color of the target landmark.

How can this be sufficient to guide the listener?

The speaker does not know the relative position of the listener (relative to the target landmark), so it cannot say something like "up, down, up, up". Or alternatively, the listener does not see the landmark colors, so it cannot understand messages like "red"....

Hope this can be clearified!

Replicating results

Hi,
which setting do you use to replicate the paper results?

On simple_tag scenario the best configuration so far is:

  • 128 unites 2 layers MLP for policy
  • 64 unites 1 layer MLP for critic
  • 60k episodes

When I run the model, after the training I get some good cases (like those ones you showed here, but even lots of cases where it doesn't work so well. Is it quite normal or I'm doing something wrong?

I still not able to replicate results for the simple_speaker_listener scenario.

How to set env_id, model_name in train.py

hi,
I‘m running into the following error:
parser.add_argument("env_id", type=str, default="simple_tag", help="name of the scenario script")
parser.add_argument("model_name", default="./model/",
help="Name of directory to store " +
"model/training contents")
main.py: error: the following arguments are required: env_id, model_name

and when simple run: python main.py --env_id simple_tag --model_name ./model/
the error will be pops up:main.py: error: unrecognized arguments: --env_id --model_name

I am not clear how to set these arguments, would you help leave me an example to add_argument run main.py?

Thanks.

Trouble setting up the environment

Hi can you add a requirements file of your environment or a setup script?
I am having trouble setting up the same for running the code.

Thanks!

Problem in building env

Hi, thank your for your public reproducing project. I just followed your guidance including downloading multi-agent environment. But there is an error when ran the main.py.

File "/home/l/anaconda3/lib/python3.7/site-packages/gym/core.py", line 64, in reset
    raise NotImplementedError
NotImplementedError

Looking forward to your reply.

Discrete action

Hi shariqiqbal2810, thanks a lot for sharing this MADDPG's implementation. Following your default parameter setting, I ran the gym scenarios ( simple or simple_spread ) for many times. But all these results showed that the agents did not learn a good policy ( For example, the rewards in the simple_spread basically are converged to -3.).

I found a difference between your implementation and OpenAI's implementation when setting the discrete action:
OpenAI's implementation: #tensor([[ 0.0125, -0.0128, 0.0105, 0.1107, -0.0798]]
This implementation: one-hot #tensor([[0., 1., 0., 0., 0.]].
It does not seem to be the root cause. Because I change this discrete action, the experimental results were still bad.

For training a good agent, could you give me some advice?

Does shared weights among actors matter?

Hi @shariqiqbal2810, thanks for the nice implementation.
I spotted a difference between your implementation and the official one. In the official code (@openai/maddpg), the actors' weights are shared, while your implementation does not.

Do you spot any difference? Such as unstable learning, slow convergence, or more diverse behavior.
Thanks in advance.

Incorrect actor update

Hi Shariq,
In my opinion, your actor update isn't right. In the paper we should use actor only for actor_id and other actions we should use from buffer. Am I right?
image
image

About discrete case

hi, could you tell me in the discrete case , how to explore the action space, just use gumbel_softmax_sample? I find the self.exploration is not used?

Implementing request: Simple reference & move and communicate ability?

Hi Shariq,

You've (unintentionally?) shut-off the shared_reward, which is why simple_reference, in which agents both move and speak will not work.

self.shared_reward = world.collaborative if hasattr(world, 'collaborative') else False

Moreover, it requires actions to be an instance of the class MultiDiscrete in gym.spaces libary. Maybe you need to use this as well (environment.py):
if all([isinstance(act_space, spaces.Discrete) for act_space in total_action_space]): act_space = spaces.MultiDiscrete([act_space.n for act_space in total_action_space])
Shall I make a pull-request with the changes?
0_5

Update order of actor and critic

It seems that you update critic before actor.

As far as I know, the actor_loss is calculated through critic network, so the backward of actor_loss will influence the grad of critic parameters.

Should we update actor first, and then update critic using both actor_loss and critic_loss?

Hello! When I run the following program according to your prompt, I get an error again, and I want to ask you

Hello! When I run the following program according to your prompt, the following error occurs again. What is the reason?
python main.py simple_world_comm predator-prey
Traceback (most recent call last):
File "main.py", line 155, in
run(config)
File "main.py", line 52, in run
config.discrete_action)
File "main.py", line 26, in make_parallel_env
return DummyVecEnv([get_env_fn(0)])
File "/home/dcy/MADDPG/maddpg-pytorch-master/utils/env_wrappers.py", line 99, in init
self.envs = [fn() for fn in env_fns]
File "/home/dcy/MADDPG/maddpg-pytorch-master/utils/env_wrappers.py", line 99, in
self.envs = [fn() for fn in env_fns]
File "main.py", line 20, in init_env
env = make_env(env_id, discrete_action=discrete_action)
File "/home/dcy/MADDPG/maddpg-pytorch-master/utils/make_env.py", line 47, in make_env
discrete_action=discrete_action)
TypeError: init() got an unexpected keyword argument 'discrete_action'

Benchmark and obtain table

Hi Shariq,

Great repo. The scenarios have benchmark data, but in your code and in the tensorflow repo, there is no script that analyses this data.

Also, to get the numbers for Table 2 (Appendix), I only count the collisions once per epoch (so if agents collide multiple times, I discard it). Regarding the avg_dist, I only check the distance at the end of an episode.

Do you have any knowledge on how the numbers of Table 2 can be obtained?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.