dgriff777 / rl_a3c_pytorch Goto Github PK
View Code? Open in Web Editor NEWA3C LSTM Atari with Pytorch plus A3G design
License: Apache License 2.0
A3C LSTM Atari with Pytorch plus A3G design
License: Apache License 2.0
What do following variables stand for?
player.flag, player.starter, player.current_life, player.info['ale.lives']
Thanks
Browsing the code, I can't help but noticing there are no synchronization among workers, i.e., using Lock mechanism to coordinate the updating of shared_model by different workers. Is this how the "hogwild" algorithm works? I have browsed several Pytorch implementation of A3C. All seem to share the same model updating mechanism. Here are my specific questions I wonder if you can enlighten me or confirm my understanding:
def ensure_shared_grads(model, shared_model):
for param, shared_param in zip(model.parameters(),
shared_model.parameters()):
if shared_param.grad is not None:
return
shared_param._grad = param.grad
Assuming my understanding of above code is wrong, shared_model keeps getting its new grad from worker A. However it is possible that before optim.step() is being executed in worker A, worker B will have updated grad for shared_model too (partially or completely). So by the time optim.step() is finished, the grad used to update param could be either coming from worker A, or from worker B, or a mix of both. Is it true?
If my above statement is true, then this way of updating model param seems to be very inefficient. Original A3C paper seems to mention periodical synchronization not sync at all times. It may help increasing stability and converging speed. Just a thought.
Thank you.
cool project! But how long have you trained for each Atari game. I have trained the SpaceInvader-v0 for 13
hours, with 16 cpus but the reward is still in 790. However, according to the original paper, when the hour is 14, the performance has reached 1400. How can I train faster.
Another problem is that the network architecture is different from the original paper proposed. Does that affect the performance ? Thank you!
Hello, is it possible to get access to some of the pre-trained models? (specifically looking for sea quest, pong and space invaders but any or all would be brilliant)
I noticed that in your player_util.py action_train function:
if self.done:
if self.gpu_id >= 0:
with torch.cuda.device(self.gpu_id):
self.cx = Variable(torch.zeros(1, 512).cuda())
self.hx = Variable(torch.zeros(1, 512).cuda())
else:
self.cx = Variable(torch.zeros(1, 512))
self.hx = Variable(torch.zeros(1, 512))
else:
self.cx = Variable(self.cx.data)
self.hx = Variable(self.hx.data)
But how can you backpropagate gradients through time, to the past 20 steps, if you set:
self.cx = Variable(self.cx.data)
self.hx = Variable(self.hx.data)
Usually max_pool2d is applied after each (relu) activation, but you apply max_pool2d before each activation. Did you try both ways and one worked better?
I looked through all the arguments and I don't see one that allows you to watch the game being played while training. I guess I must be missing something really obvious. Great work by the way!
@dgriff777 thanks again for providing this amazing repo, was wondering if num_workers should be equal to the number of threads instead of the number of cores as suggested in README.md
. I'm new to A3C so please bear with me if this is a naive question :P
Is it normal to get this trace back in the console? It spams for a few dozen times and then stops abruptly. Then, it starts logging the training session as intended. Sorry if this question is incredibly ignorant lol. I'm new to python and the world of ai. Figured I'd post my question here before searching Google. Thanks in advance.
C:\Users\joshu\Documents\0AIFolder\00A3CA3Gatari\rl_a3c_pytorch-master\shared_optim.py:167: UserWarning: This overload of add_ is deprecated: add_(Number alpha, Tensor other) Consider using one of the following signatures instead: add_(Tensor other, *, Number alpha) (Triggered internally at ..\torch\csrc\utils\python_arg_parser.cpp:882.) exp_avg.mul_(beta1).add_(1 - beta1, grad)
I ran python main.py --env Pong-v0 --workers 32
and
python gym_eval.py --env Pong-v0 --num-episodes 100
, but I dont see any visualization of the game. Can i turn it on somehow?
Why is this in train.py ?
if args.count_lives:
if lives > info['ale.lives']:
done = True
Hi,
Due to lack of resources, i cant train the models myself. Therefore, I need the pre trained models of the various games. Is it possible for you to share the models with me? It would be very helpful.
Thanks
Slightly baffled by this:
Traceback (most recent call last):
File "/home/bob/anaconda3/envs/rl_env/lib/python3.5/multiprocessing/process.py", line 252, in _bootstrap
self.run()
File "/home/bob/anaconda3/envs/rl_env/lib/python3.5/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/home/bob/PycharmProjects/rl_a3c_pytorch-master/train.py", line 31, in train
player.state = player.env.reset()
File "/home/bob/anaconda3/envs/rl_env/lib/python3.5/site-packages/gym/core.py", line 123, in reset
observation = self._reset()
File "/home/bob/anaconda3/envs/rl_env/lib/python3.5/site-packages/gym/core.py", line 375, in _reset
observation = self.env.reset()
File "/home/bob/anaconda3/envs/rl_env/lib/python3.5/site-packages/gym/core.py", line 123, in reset
observation = self._reset()
File "/home/bob/anaconda3/envs/rl_env/lib/python3.5/site-packages/gym/core.py", line 376, in _reset
return self._observation(observation)
File "/home/bob/anaconda3/envs/rl_env/lib/python3.5/site-packages/gym/core.py", line 386, in _observation
raise NotImplementedError
NotImplementedError
I have trained the model a night on Breakout-v0, however the reward is always 0. What reasons may cause this situation? Or could you tell me what the parameters you are using when training to play Breakout-v0? Thank you. Here is the log file.
log.txt
Model files are rather large (understandably), so it's quite a pain to clone the repo for local use.
One reasonable way around this is to archive them and make it available as a release file instead.
When I runing a3c on 8 cpus, it still slow.
My cpu is Xeon(R) Platinum 8255C. Is this the reason for my poor cpu performance or torch multiprocessing 's problem?
Hi, How do you think about reward smoothing.
The collected rewards have high variance. In order to show the tendency of reward curve, should we do some reward smoothing operation as same as tensorboard smoothing?
If so, which smoothing method should I choose, exponential smoothing or average smoothing?
in your model you have 4 cnn layers and max pooling.
questions:
thanks.
I have tried twice to train the agent on Seaquest-v0 with 32 workers on a server, but after 13 hours of training, the score seems to be stuck at 2700/2800 maximum.
Here's the log file :
log.txt
I'am using gym 0.8.1 and Atari-py 0.0.21 and let all the hyperparameters to their default value.
Any idea why the score obtained is much lower than the one you obtained ? (>50000)
Would you have the trained model for Seaquest-v0 ?
Thanks !
As far as I can see, model hyperparameters are different.
Thanks.
in you ensure_shared_grads function:
if shared_param.grad is not None and not gpu:
return
Can you explain what this means?
Just want to clarify that there is only one saved model per environment and it will be overwritten each training epoch, right? For example, MsPacman will only have one saved model trained_models/MsPacman-v0.dat
Could I ask how long it takes to train Breakout from scratch to get the desire score (859.57 for Breakout-v3)?
Have You tried BreakoutNoFrameskip? This is a version without repetition and randomness.
Thanks!
When I run the program on multi gpu, that is, I set gpu_id to be [0, 1, 2, 3], it report error "Some of weight/gradient/input teSLTM nsors are located on different GPUs. Please move them to a single one"
First, thank you for your great work of a3c implementation.
I run the code with python main.py --workers 1 --gpu-ids 5
and find out that one process runs on 2 gpus. Similar things happened when I run with --workers 50
. All the processes should run on gpu 5. However, I find that all of these processes (same PID) run on gpu 0 with Type C
and smaller GPU Memory Usage
compared with those run on gpu 5. How can I assign all the processes on gpu 5? Thank you very much!
Hi @dgriff777 . Thank you for your repo. It's great that it can achieve such a high score. But I met a problem when I try to apply it to MsPacman-v0.
I simply used this command python main.py --env MsPacman-v0 --workers 7
Then, I get the test score like this:
2018-10-27 15:59:44,767 : lr: 0.0001
2018-10-27 15:59:44,767 : gamma: 0.99
2018-10-27 15:59:44,767 : tau: 1.0
2018-10-27 15:59:44,767 : seed: 1
2018-10-27 15:59:44,767 : workers: 7
2018-10-27 15:59:44,767 : num_steps: 20
2018-10-27 15:59:44,767 : max_episode_length: 10000
2018-10-27 15:59:44,767 : env: MsPacman-v0
2018-10-27 15:59:44,767 : env_config: config.json
2018-10-27 15:59:44,767 : shared_optimizer: True
2018-10-27 15:59:44,767 : load: False
2018-10-27 15:59:44,767 : save_max: True
2018-10-27 15:59:44,767 : optimizer: Adam
2018-10-27 15:59:44,767 : load_model_dir: trained_models/
2018-10-27 15:59:44,767 : save_model_dir: trained_models/
2018-10-27 15:59:44,767 : log_dir: logs/
2018-10-27 15:59:44,767 : gpu_ids: [-1]
2018-10-27 15:59:44,767 : amsgrad: True
2018-10-27 15:59:44,767 : skip_rate: 4
2018-10-27 15:59:52,746 : Time 00h 00m 07s, episode reward 60.0, episode length 429, reward mean 60.0000
2018-10-27 16:00:17,886 : Time 00h 00m 32s, episode reward 70.0, episode length 619, reward mean 65.0000
2018-10-27 16:00:43,513 : Time 00h 00m 58s, episode reward 70.0, episode length 628, reward mean 66.6667
2018-10-27 16:01:09,034 : Time 00h 01m 24s, episode reward 70.0, episode length 633, reward mean 67.5000
2018-10-27 16:01:34,687 : Time 00h 01m 49s, episode reward 70.0, episode length 615, reward mean 68.0000
2018-10-27 16:02:00,366 : Time 00h 02m 15s, episode reward 70.0, episode length 641, reward mean 68.3333
2018-10-27 16:02:25,238 : Time 00h 02m 40s, episode reward 70.0, episode length 624, reward mean 68.5714
2018-10-27 16:02:50,496 : Time 00h 03m 05s, episode reward 70.0, episode length 622, reward mean 68.7500
2018-10-27 16:03:15,714 : Time 00h 03m 30s, episode reward 70.0, episode length 631, reward mean 68.8889
2018-10-27 16:03:40,280 : Time 00h 03m 55s, episode reward 70.0, episode length 626, reward mean 69.0000
2018-10-27 16:04:05,072 : Time 00h 04m 20s, episode reward 70.0, episode length 628, reward mean 69.0909
The test score is always 70 and It seems that the agent will choose the same way every time and stop at a corner.
Could you tell me how did you train the model to get 6323.01 ± 116.91 scores in MsPacman-v0? Is there any other parameters that I should set?
I try to eval your trained model, however the result has no effect:
2017-08-01 21:08:13,757 : reward sum: -21.0, reward mean: -21.0000
[2017-08-01 21:08:13,757] reward sum: -21.0, reward mean: -21.0000
[2017-08-01 21:08:13,787] Starting new video recorder writing to /Volumes/xs/CodeSpace/AISpace/rl_space/rl_a3c_pytorch/Pong-v0_monitor/openaigym.video.0.33472.video000001.mp4
2017-08-01 21:08:24,947 : reward sum: -21.0, reward mean: -21.0000
[2017-08-01 21:08:24,947] reward sum: -21.0, reward mean: -21.0000
2017-08-01 21:08:35,054 : reward sum: -21.0, reward mean: -21.0000
[2017-08-01 21:08:35,054] reward sum: -21.0, reward mean: -21.0000
2017-08-01 21:08:44,732 : reward sum: -21.0, reward mean: -21.0000
[2017-08-01 21:08:44,732] reward sum: -21.0, reward mean: -21.0000
And the record is white-and-black videos, can not just show on screen.
Hi,
I think there is one problem on the max_length. Max_length is 20000 in your default setting. But the gym's internal max episode length is 10000.
In testing, when the number of steps == 10000, Player.done = True and not player.max_length. It's possible that player.info['ale.lives'] > 0 at this time. Now the condition
if player.done and player.info['ale.lives'] > 0 and not player.max_length:
satisfies
In such condition, you reset the environment. Now in EpisodicLifeEnv, self.was_real_done in environment.py is True and you actually reset the gym environment (which is correct). But your code doesn't treat this 10000 steps episode as a terminated episode, instead, your code assumes the episode doesn't terminate because player.info['ale.lives'] > 0.
If I want to train on a new game, how should I choose the initial parameters to start tuning? Any suggestions?
Thank you for the nice implementation. I'm curious about the running time on your machine. In https://github.com/ikostrikov/pytorch-a3c, it is reported that PongDeterministic-v3 is solved around 15min, did you reproduce similar results in any version of Pong?
Thank you
Hi,
Thank for your work. I am wondering how did you train the seaquest to achieve such a high performance. I always stuck at ~4000 scores. Could you please share the hyperparamerters?
SpaceInvaders-v0,please
Is there a reason why the default for eps in the adam optimizer is so high? Currently, it is 1e-3 [line 104 in shared_optim.py]. Usually, it's around 1e-08. Just wanted to see if this was done intentionally (e.g., it works better than when it is lower) or not.
Hi, thanks so much for the excellent codebase. Just wondering, is there any way to plot the training curve as a function of timesteps (as opposed to plotting the training curve as a function of time passed)?
Thanks!
Thanks for the implementation! Great codes.
I read/ran your codes and realized it is processing training examples with just a batch_size=1 (instead of large batch size, am I correct on this?). I am just wondering if this is designed on purpose due to your G-A3C. With larger batch size things are running faster with GPU, so why batch_size=1?
Is there anything we can do to run it on large batches?
Thank you.
In the NormalizedEnv, I am puzzled why you choose alpha equal to 0.9999, if I want unbiased_mean, should the alpha be (num_step - 1)/num_step?
I am also puzzled why normalize environment observation has a such huge influence on the performance? can you explain to me? Thanks!
Hi, I'd to have a question about the following block
Line 60 in eb5c9b9
if player.done and not player.info:
state = player.env.reset()
player.eps_len += 2
player.state = torch.from_numpy(state).float()
if gpu_id >= 0:
with torch.cuda.device(gpu_id):
player.state = player.state.cuda()
elif player.info:
I don't quite understand when the info
equals True
or False
, what is the meaning off having info=True
and info=False
?
I can't seem to find a documentation about this info flag on Gym website :(
Thanks
The links to the Gym environment evaluations is 404.
It seems doesn't get lock when update network params in SharedAdam.
However, isn't there process safety problem without a lock?
Are lines 41 and 42 in this file to detach the previous states?
https://github.com/dgriff777/rl_a3c_pytorch/blob/master/train.py
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.