Basic RL Algorithms Implementations

Starcraft 2 Multiple Agents Results with PPO (https://github.com/oxwhirl/smac)
Every agent was controlled independently and has restricted information
All the environments were trained with a default difficulty level 7
No curriculum, just baseline PPO
Full state information wasn't used for critic, actor and critic recieved the same agent observations
Most results are significantly better by win rate and were trained on a single PC much faster than QMIX (https://arxiv.org/pdf/1902.04043.pdf), MAVEN (https://arxiv.org/pdf/1910.07483.pdf) or QTRAN
No hyperparameter search
4 frames + conv1d actor-critic network
Miniepoch num was set to 1, higher numbers didn't work
Simple MLP networks didnot work good on hard envs

How to run configs:

Pytorch

python runner.py --train --file rl_games/configs/smac/3m_torch.yaml
python runner.py --play --file rl_games/configs/smac/3m_torch.yaml --checkpoint 'nn/3m_cnn'

Tensorflow

python runner.py --tf --train --file rl_games/configs/smac/3m_torch.yaml
python runner.py --tf --play --file rl_games/configs/smac/3m_torch.yaml --checkpoint 'nn/3m_cnn'
tensorboard --logdir runs

Results on some environments:

2m_vs_1z took near 2 minutes to achive 100% WR
corridor took near 2 hours for 95+% WR
MMM2 4 hours for 90+% WR
6h_vs_8z got 82% WR after 8 hours of training
5m_vs_6m got 72% WR after 8 hours of training

Plots:

FPS in these plots is calculated on per env basis except MMM2 (it was scaled by number of agents which is 10), to get a win rate per number of environmental steps info, the same as used in plots in QMIX, MAVEN, QTRAN or Deep Coordination Graphs (https://arxiv.org/pdf/1910.00091.pdf) papers FPS numbers under the horizontal axis should be devided by number of agents in player's team.

2m_vs_1z:
3s5z_vs_3s6z:
3s_vs_5z:
corridor:
5m_vs_6m:
MMM2:

Link to the continuous results

Currently Implemented:

DQN
Double DQN
Dueling DQN
Noisy DQN
N-Step DQN
Categorical
Rainbow DQN
A2C
PPO

Tensorflow implementations of the DQN atari.

Double dueling DQN vs DQN with the same parameters

Near 90 minutes to learn with this setup.

Different DQN Configurations tests

Light grey is noisy 1-step dddqn. Noisy 3-step dddqn was even faster. Best network (configuration 5) needs near 20 minutes to learn, on NVIDIA 1080. Currently the best setup for pong is noisy 3-step double dueling network. In pong_runs.py different experiments could be found. Less then 200k frames to take score > 18. DQN has more optimistic Q value estimations.

Other Games Results

This results are not stable. Just best games, for good average results you need to train network more then 10 million steps. Some games need 50m steps.

5 million frames two step noisy double dueling dqn:

Random lucky game in Space Invaders after less then one hour learning:

A2C and PPO Results

More than 2 hours for Pong to achieve 20 score with one actor playing.
8 Hours for Supermario lvl1

PPO with LSTM layers

allensmile / rl_games Goto Github PK

rl_games's Introduction

Basic RL Algorithms Implementations

How to run configs:

Pytorch

Tensorflow

Results on some environments:

Plots:

Other Games Results

A2C and PPO Results

rl_games's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs