GithubHelp home page GithubHelp logo

dgriff777 / rl_a3c_pytorch Goto Github PK

View Code? Open in Web Editor NEW
554.0 20.0 117.0 101.77 MB

A3C LSTM Atari with Pytorch plus A3G design

License: Apache License 2.0

Python 100.00%
python pytorch pytorch-a3c reinforcement-learning atari openai-gym a3c deep-reinforcement-learning actor-critic asynchronous-advantage-actor-critic

rl_a3c_pytorch's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

rl_a3c_pytorch's Issues

synchronization among workers

Browsing the code, I can't help but noticing there are no synchronization among workers, i.e., using Lock mechanism to coordinate the updating of shared_model by different workers. Is this how the "hogwild" algorithm works? I have browsed several Pytorch implementation of A3C. All seem to share the same model updating mechanism. Here are my specific questions I wonder if you can enlighten me or confirm my understanding:

  1. In setting the grad for shared_model (see function blow), why do you check "if shared_param.grad is not None"? Is this the way to prevent one worker oversteping another worker? After first setting of shared_param._grad=param.grad, shared_param.grad will never be "None" again because neither optimizer.zero_grad() or optimizer.step() will reset shared_param.grad to None. Wouldn't this prevent shared_param._grad from being set to param.grad ever again within the same worker?

def ensure_shared_grads(model, shared_model):
for param, shared_param in zip(model.parameters(),
shared_model.parameters()):
if shared_param.grad is not None:
return
shared_param._grad = param.grad

  1. Assuming my understanding of above code is wrong, shared_model keeps getting its new grad from worker A. However it is possible that before optim.step() is being executed in worker A, worker B will have updated grad for shared_model too (partially or completely). So by the time optim.step() is finished, the grad used to update param could be either coming from worker A, or from worker B, or a mix of both. Is it true?

  2. If my above statement is true, then this way of updating model param seems to be very inefficient. Original A3C paper seems to mention periodical synchronization not sync at all times. It may help increasing stability and converging speed. Just a thought.

Thank you.

How long have you trained the model

cool project! But how long have you trained for each Atari game. I have trained the SpaceInvader-v0 for 13
hours, with 16 cpus but the reward is still in 790. However, according to the original paper, when the hour is 14, the performance has reached 1400. How can I train faster.
Another problem is that the network architecture is different from the original paper proposed. Does that affect the performance ? Thank you!

Pretrained models

Hello, is it possible to get access to some of the pre-trained models? (specifically looking for sea quest, pong and space invaders but any or all would be brilliant)

Question on training function

I noticed that in your player_util.py action_train function:

if self.done:
    if self.gpu_id >= 0:
        with torch.cuda.device(self.gpu_id):
            self.cx = Variable(torch.zeros(1, 512).cuda())
            self.hx = Variable(torch.zeros(1, 512).cuda())
    else:
        self.cx = Variable(torch.zeros(1, 512))
        self.hx = Variable(torch.zeros(1, 512))
else:
    self.cx = Variable(self.cx.data)
    self.hx = Variable(self.hx.data)

But how can you backpropagate gradients through time, to the past 20 steps, if you set:

self.cx = Variable(self.cx.data)
self.hx = Variable(self.hx.data)

watching game visualization during training

I looked through all the arguments and I don't see one that allows you to watch the game being played while training. I guess I must be missing something really obvious. Great work by the way!

Clarification needed regarding num_workers

@dgriff777 thanks again for providing this amazing repo, was wondering if num_workers should be equal to the number of threads instead of the number of cores as suggested in README.md. I'm new to A3C so please bear with me if this is a naive question :P

UserWarning: This overload of add_ is deprecated

Is it normal to get this trace back in the console? It spams for a few dozen times and then stops abruptly. Then, it starts logging the training session as intended. Sorry if this question is incredibly ignorant lol. I'm new to python and the world of ai. Figured I'd post my question here before searching Google. Thanks in advance.
C:\Users\joshu\Documents\0AIFolder\00A3CA3Gatari\rl_a3c_pytorch-master\shared_optim.py:167: UserWarning: This overload of add_ is deprecated: add_(Number alpha, Tensor other) Consider using one of the following signatures instead: add_(Tensor other, *, Number alpha) (Triggered internally at ..\torch\csrc\utils\python_arg_parser.cpp:882.) exp_avg.mul_(beta1).add_(1 - beta1, grad)

Visualization not appearing

I ran python main.py --env Pong-v0 --workers 32 and
python gym_eval.py --env Pong-v0 --num-episodes 100, but I dont see any visualization of the game. Can i turn it on somehow?

Need for trained models

Hi,

Due to lack of resources, i cant train the models myself. Therefore, I need the pre trained models of the various games. Is it possible for you to share the models with me? It would be very helpful.

Thanks

NotImplementedError

Slightly baffled by this:

Traceback (most recent call last):
File "/home/bob/anaconda3/envs/rl_env/lib/python3.5/multiprocessing/process.py", line 252, in _bootstrap
self.run()
File "/home/bob/anaconda3/envs/rl_env/lib/python3.5/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/home/bob/PycharmProjects/rl_a3c_pytorch-master/train.py", line 31, in train
player.state = player.env.reset()
File "/home/bob/anaconda3/envs/rl_env/lib/python3.5/site-packages/gym/core.py", line 123, in reset
observation = self._reset()
File "/home/bob/anaconda3/envs/rl_env/lib/python3.5/site-packages/gym/core.py", line 375, in _reset
observation = self.env.reset()
File "/home/bob/anaconda3/envs/rl_env/lib/python3.5/site-packages/gym/core.py", line 123, in reset
observation = self._reset()
File "/home/bob/anaconda3/envs/rl_env/lib/python3.5/site-packages/gym/core.py", line 376, in _reset
return self._observation(observation)
File "/home/bob/anaconda3/envs/rl_env/lib/python3.5/site-packages/gym/core.py", line 386, in _observation
raise NotImplementedError
NotImplementedError

Reward is always 0 when training Breakout-v0

I have trained the model a night on Breakout-v0, however the reward is always 0. What reasons may cause this situation? Or could you tell me what the parameters you are using when training to play Breakout-v0? Thank you. Here is the log file.
log.txt

Move trained_models into release

Model files are rather large (understandably), so it's quite a pain to clone the repo for local use.
One reasonable way around this is to archive them and make it available as a release file instead.

run a3c on 8 cpus, it still slow.

When I runing a3c on 8 cpus, it still slow.
My cpu is Xeon(R) Platinum 8255C. Is this the reason for my poor cpu performance or torch multiprocessing 's problem?

Reward Smoothing

Hi, How do you think about reward smoothing.
The collected rewards have high variance. In order to show the tendency of reward curve, should we do some reward smoothing operation as same as tensorboard smoothing?
If so, which smoothing method should I choose, exponential smoothing or average smoothing?

cnn layers

in your model you have 4 cnn layers and max pooling.

  1. dqn 2015 used only 3 cnn layers without pooling
  2. a3c 2016 used only 2 cnn layers without pooling

questions:

  1. don't you think pooling actually lose spatial information of RL scene, which imo is important, why you decided using pooling instead of increasing stride to 2?
  2. why you decided using 4 cnn layers, possibly gym v0 environments are harder?
  3. any particular reasons for such specific weight init (final actor/critic linear weights)?

thanks.

Seaquest-v0 not training as well as announced

I have tried twice to train the agent on Seaquest-v0 with 32 workers on a server, but after 13 hours of training, the score seems to be stuck at 2700/2800 maximum.

Here's the log file :
log.txt

I'am using gym 0.8.1 and Atari-py 0.0.21 and let all the hyperparameters to their default value.
Any idea why the score obtained is much lower than the one you obtained ? (>50000)
Would you have the trained model for Seaquest-v0 ?
Thanks !

Why ensure_shared_grads

in you ensure_shared_grads function:

 if shared_param.grad is not None and not gpu:
            return

Can you explain what this means?

question about trained models

Just want to clarify that there is only one saved model per environment and it will be overwritten each training epoch, right? For example, MsPacman will only have one saved model trained_models/MsPacman-v0.dat

Performance of Breakout

Could I ask how long it takes to train Breakout from scratch to get the desire score (859.57 for Breakout-v3)?

Have You tried BreakoutNoFrameskip? This is a version without repetition and randomness.

Thanks!

multi gpu support

When I run the program on multi gpu, that is, I set gpu_id to be [0, 1, 2, 3], it report error "Some of weight/gradient/input teSLTM nsors are located on different GPUs. Please move them to a single one"

Why one process run on 2 gpus?

First, thank you for your great work of a3c implementation.

I run the code with python main.py --workers 1 --gpu-ids 5 and find out that one process runs on 2 gpus. Similar things happened when I run with --workers 50. All the processes should run on gpu 5. However, I find that all of these processes (same PID) run on gpu 0 with Type C and smaller GPU Memory Usage compared with those run on gpu 5. How can I assign all the processes on gpu 5? Thank you very much!

1
2

Stuck when training in MsPacman-v0

Hi @dgriff777 . Thank you for your repo. It's great that it can achieve such a high score. But I met a problem when I try to apply it to MsPacman-v0.

I simply used this command python main.py --env MsPacman-v0 --workers 7
Then, I get the test score like this:

2018-10-27 15:59:44,767 : lr: 0.0001
2018-10-27 15:59:44,767 : gamma: 0.99
2018-10-27 15:59:44,767 : tau: 1.0
2018-10-27 15:59:44,767 : seed: 1
2018-10-27 15:59:44,767 : workers: 7
2018-10-27 15:59:44,767 : num_steps: 20
2018-10-27 15:59:44,767 : max_episode_length: 10000
2018-10-27 15:59:44,767 : env: MsPacman-v0
2018-10-27 15:59:44,767 : env_config: config.json
2018-10-27 15:59:44,767 : shared_optimizer: True
2018-10-27 15:59:44,767 : load: False
2018-10-27 15:59:44,767 : save_max: True
2018-10-27 15:59:44,767 : optimizer: Adam
2018-10-27 15:59:44,767 : load_model_dir: trained_models/
2018-10-27 15:59:44,767 : save_model_dir: trained_models/
2018-10-27 15:59:44,767 : log_dir: logs/
2018-10-27 15:59:44,767 : gpu_ids: [-1]
2018-10-27 15:59:44,767 : amsgrad: True
2018-10-27 15:59:44,767 : skip_rate: 4
2018-10-27 15:59:52,746 : Time 00h 00m 07s, episode reward 60.0, episode length 429, reward mean 60.0000
2018-10-27 16:00:17,886 : Time 00h 00m 32s, episode reward 70.0, episode length 619, reward mean 65.0000
2018-10-27 16:00:43,513 : Time 00h 00m 58s, episode reward 70.0, episode length 628, reward mean 66.6667
2018-10-27 16:01:09,034 : Time 00h 01m 24s, episode reward 70.0, episode length 633, reward mean 67.5000
2018-10-27 16:01:34,687 : Time 00h 01m 49s, episode reward 70.0, episode length 615, reward mean 68.0000
2018-10-27 16:02:00,366 : Time 00h 02m 15s, episode reward 70.0, episode length 641, reward mean 68.3333
2018-10-27 16:02:25,238 : Time 00h 02m 40s, episode reward 70.0, episode length 624, reward mean 68.5714
2018-10-27 16:02:50,496 : Time 00h 03m 05s, episode reward 70.0, episode length 622, reward mean 68.7500
2018-10-27 16:03:15,714 : Time 00h 03m 30s, episode reward 70.0, episode length 631, reward mean 68.8889
2018-10-27 16:03:40,280 : Time 00h 03m 55s, episode reward 70.0, episode length 626, reward mean 69.0000
2018-10-27 16:04:05,072 : Time 00h 04m 20s, episode reward 70.0, episode length 628, reward mean 69.0909

The test score is always 70 and It seems that the agent will choose the same way every time and stop at a corner.

Could you tell me how did you train the model to get 6323.01 ± 116.91 scores in MsPacman-v0? Is there any other parameters that I should set?

I just want to say your trained model has no effect

I try to eval your trained model, however the result has no effect:

2017-08-01 21:08:13,757 : reward sum: -21.0, reward mean: -21.0000
[2017-08-01 21:08:13,757] reward sum: -21.0, reward mean: -21.0000
[2017-08-01 21:08:13,787] Starting new video recorder writing to /Volumes/xs/CodeSpace/AISpace/rl_space/rl_a3c_pytorch/Pong-v0_monitor/openaigym.video.0.33472.video000001.mp4
2017-08-01 21:08:24,947 : reward sum: -21.0, reward mean: -21.0000
[2017-08-01 21:08:24,947] reward sum: -21.0, reward mean: -21.0000
2017-08-01 21:08:35,054 : reward sum: -21.0, reward mean: -21.0000
[2017-08-01 21:08:35,054] reward sum: -21.0, reward mean: -21.0000
2017-08-01 21:08:44,732 : reward sum: -21.0, reward mean: -21.0000
[2017-08-01 21:08:44,732] reward sum: -21.0, reward mean: -21.0000

And the record is white-and-black videos, can not just show on screen.

Question on max-length

Hi,

I think there is one problem on the max_length. Max_length is 20000 in your default setting. But the gym's internal max episode length is 10000.

In testing, when the number of steps == 10000, Player.done = True and not player.max_length. It's possible that player.info['ale.lives'] > 0 at this time. Now the condition
if player.done and player.info['ale.lives'] > 0 and not player.max_length:
satisfies

In such condition, you reset the environment. Now in EpisodicLifeEnv, self.was_real_done in environment.py is True and you actually reset the gym environment (which is correct). But your code doesn't treat this 10000 steps episode as a terminated episode, instead, your code assumes the episode doesn't terminate because player.info['ale.lives'] > 0.

Train a new game

If I want to train on a new game, how should I choose the initial parameters to start tuning? Any suggestions?

Solving time

Thank you for the nice implementation. I'm curious about the running time on your machine. In https://github.com/ikostrikov/pytorch-a3c, it is reported that PongDeterministic-v3 is solved around 15min, did you reproduce similar results in any version of Pong?

Thank you

Hyperparameters for training

Hi,
Thank for your work. I am wondering how did you train the seaquest to achieve such a high performance. I always stuck at ~4000 scores. Could you please share the hyperparamerters?

eps for Adam

Is there a reason why the default for eps in the adam optimizer is so high? Currently, it is 1e-3 [line 104 in shared_optim.py]. Usually, it's around 1e-08. Just wanted to see if this was done intentionally (e.g., it works better than when it is lower) or not.

plot rewards as a function of number of timesteps

Hi, thanks so much for the excellent codebase. Just wondering, is there any way to plot the training curve as a function of timesteps (as opposed to plotting the training curve as a function of time passed)?

Thanks!

Quick question on batch processing

Thanks for the implementation! Great codes.

I read/ran your codes and realized it is processing training examples with just a batch_size=1 (instead of large batch size, am I correct on this?). I am just wondering if this is designed on purpose due to your G-A3C. With larger batch size things are running faster with GPU, so why batch_size=1?

Is there anything we can do to run it on large batches?

Thank you.

Normalize

In the NormalizedEnv, I am puzzled why you choose alpha equal to 0.9999, if I want unbiased_mean, should the alpha be (num_step - 1)/num_step?

I am also puzzled why normalize environment observation has a such huge influence on the performance? can you explain to me? Thanks!

Question about Test function

Hi, I'd to have a question about the following block

if player.done and not player.info:

    if player.done and not player.info:
        state = player.env.reset()
        player.eps_len += 2
        player.state = torch.from_numpy(state).float()
        if gpu_id >= 0:
            with torch.cuda.device(gpu_id):
                player.state = player.state.cuda()
    elif player.info:

I don't quite understand when the info equals True or False, what is the meaning off having info=True and info=False ?

I can't seem to find a documentation about this info flag on Gym website :(

Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.