chenglongchen / pytorch-drl Goto Github PK

PyTorch implementations of various Deep Reinforcement Learning (DRL) algorithms for both single agent and multi-agent.

License: MIT License

Python 100.00%

pytorch deep-reinforcement-learning multi-agent deep-q-network actor-critic advantage-actor-critic a2c proximal-policy-optimization ppo deep-deterministic-policy-gradient

pytorch-drl's Introduction

pytorch-madrl

This project includes PyTorch implementations of various Deep Reinforcement Learning algorithms for both single agent and multi-agent.

It is written in a modular way to allow for sharing code between different algorithms. In specific, each algorithm is represented as a learning agent with a unified interface including the following components:

interact: interact with the environment to collect experience. Taking one step forward and n steps forward are both supported (see _take_one_step_ and _take_n_steps, respectively)
train: train on a sample batch
exploration_action: choose an action based on state with random noise added for exploration in training
action: choose an action based on state for execution
value: evaluate value for a state-action pair
evaluation: evaluation the learned agent

Requirements

gym
python 3.6
pytorch

Usage

To train a model:

$ python run_a2c.py

Results

It's extremely difficult to reproduce results for Reinforcement Learning algorithms. Due to different settings, e.g., random seed and hyper parameters etc, you might get different results compared with the followings.

A2C

ACKTR

DDPG

DQN

PPO

TODO

TRPO
LOLA
Parameter noise

Acknowledgments

This project gets inspirations from the following projects:

Ilya Kostrikov's pytorch-a2c-ppo-acktr (kfac optimizer is taken from here)
OpenAI's baselines

License

MIT

pytorch-drl's People

Stargazers

Watchers

Forkers

hengqujushi amoliu benjamesbabala danpechi hughperkins shinshiner tallni2 sadhumangal eycab aiot-tech xiaoanshi b2220333 smrjans tegg89 wh-forker shuangxinie jk-cim johnyfeng jiangbingqing liyaangy alvinlxs flybirp jainnitk serenashuangzhang ntsliyang w470062742 qmqmyu hatleon liwenzhang0201 lucia-wen jeme-yufeng-zhan gaorunzebit viviansun2013 nirvana6234 louiseviden bandaidz narendrapatwardhan zachkeer qxwsniff staminatang ltkevin lucky666123 hanbaoan123 ecustboy qiu1234567 aritrasen87 zhangyifengdavid nimaaimldl dongchen06 yym064 ltwyer wangshuo1994 wumin12 kingsvalley blankslide yuzhouxianzhi mgilgamesh yitaochen legendaryhao hzfmax qwzhong1988 lbwking larkz imhewon liuqianqian-x fahriwm wyhallenwu pa-wan zhili-zh shaoyangwangchn wwongkamjan xiaohuojianchendiwen jck-1096 lupiyalp timckai 1358889590 ponysubway bothmena zzsong1023 zhy109 shun526 wulin233 a-acuto fzvincent minsoo2018 jiw18020 lxjlu yuffiehuang amine179 yuling91 ella-momo am2repo whubao lightcccblue hbh22110182 bututoubaobei thatsmygirl toksjazz raymondshang meiko530

pytorch-drl's Issues

how's this framework going in 2021?

bug

https://github.com/ChenglongChen/pytorch-madrl/blob/44b8f082ffb054c6dabead8971b66a7be4c1f9ff/A2C.py#L56

should add one more line: self.critic.cuda().

Also could you provide some example for running the multi-agent case? Thanks!

The Actor Critic Structure in MAA2C

A little confused about your implementation of MAA2C. I don't think the input of the actor network is simply the ``joint state" of the agents. According to [1] the critic's input should be state of the environment (where agents' joint state is not necessarily defined) + the joint action of the agents, i.e., the critic here should be a Q-function for joint actions. And for the actor it should be something like a policy, where I am not quite understand why the actor network is implemented in this way. Appreciate if explained.

No continuous action space mode

cannot find common.kfac

In the ACKTR.pg, a KFACOptimizer is imported. But no source file is found

About the computation of Advantage and State Value in PPO

In your implementation of Critic, you feed the network of the observation and action and output 1-dim value. Can I make the inference that It is Q(s,a) ?
But the advantage you given is
values = self.critic_target(states_var, actions_var).detach() advantages = rewards_var - values
It is the estimation of q_t minus Q(s_t,a)
I think it should be Advantage = q_t - V(s_t)

License?

Hi,

What is the license?

Hugh