nvlabs / ga3c Goto Github PK
View Code? Open in Web Editor NEWHybrid CPU/GPU implementation of the A3C algorithm for deep reinforcement learning.
License: BSD 3-Clause "New" or "Revised" License
Hybrid CPU/GPU implementation of the A3C algorithm for deep reinforcement learning.
License: BSD 3-Clause "New" or "Revised" License
Hello!
I'm currently trying to train on a problem that requires anywhere from 500 to 10,000 steps per episode. The training for this is excruciatingly slow when using the default config values. I've been messing around with some of the parameters but haven't been making any headway. Any recommendations ways to improve the training speed?
edit: the main thing i tried to modify was setting tmax to a very large number to try and batch each episode into a single update. This helped, however not as much as I hoped.
This is more a question than an issue. I was looking at the code, and I realized that you use the functions like train
, predict_p_and_v
, predict_p
and some others from file NetworkVP.py
. All this functions use the method sess.run
, my question is: how do you know that this code runs without a problem? I ask because as far as I can see, there's nothing controlling that sess.run
is not called while another sess.run
is in use. I thought that I needed to use a coordinator
to do something like that. If you have a reference just to understand, that'd be great. Thanks!
in File server.py, line 131:
while self.trainers:
self.add_trainer()
Should it be self.remove_trainer() ?
I've just finished reading the GA3C paper, and I didn't notice any mention of the frame processing to remove flickering artifacts, as explained in DeepMind's original Deep-Q paper methods section, and referenced in the experimental setup for the A3C paper.
I've just had a look through the source code, and it seems as though the the Environment._preprocess(image) function is applied on a frame by frame basis, with no joint thresholding across the previous frame. I was wondering if this was a design decision? If not, is there any plan to incorporate this preprocessing step into the model, to more closely resemble DeepMind's A3C approach?
Alternatively, if I've misinterpretted the code (which is entirely possible), I would be grateful if you could point out where/how this preprocessing step is incorporated.
I will try implementing it efficiently myself in the meantime, which you are welcome to use if helpful.
P.s. this code is proving to be extremely useful!
Thanks a lot for sharing!
Dan
Isn't there any plan on the horizon to port this code to pyTorch ?
Hi,
What is the meaning of the minus Rscore from the output I get? Should I apply abs(Rscore)
to get the actual reward?
and how do I know when it is near to converge?
[Time: 943] [Episode: 1046 Score: -20.0000] [RScore: -20.3550 RPPS: 1255] [PPS: 1228 TPS: 208] [NT: 2 NP: 2 NA: 33]
[Time: 943] [Episode: 1047 Score: -20.0000] [RScore: -20.3550 RPPS: 1256] [PPS: 1229 TPS: 208] [NT: 2 NP: 2 NA: 33]
[Time: 943] [Episode: 1048 Score: -19.0000] [RScore: -20.3530 RPPS: 1259] [PPS: 1230 TPS: 208] [NT: 2 NP: 2 NA: 33]
[Time: 944] [Episode: 1049 Score: -21.0000] [RScore: -20.3530 RPPS: 1259] [PPS: 1231 TPS: 208] [NT: 2 NP: 2 NA: 33]
[Time: 944] [Episode: 1050 Score: -19.0000] [RScore: -20.3520 RPPS: 1258] [PPS: 1231 TPS: 208] [NT: 2 NP: 2 NA: 33]
[Time: 944] [Episode: 1051 Score: -20.0000] [RScore: -20.3530 RPPS: 1259] [PPS: 1232 TPS: 208] [NT: 2 NP: 2 NA: 33]
[Time: 945] [Episode: 1052 Score: -20.0000] [RScore: -20.3530 RPPS: 1259] [PPS: 1233 TPS: 208] [NT: 2 NP: 2 NA: 33]
[Time: 947] [Episode: 1053 Score: -20.0000] [RScore: -20.3530 RPPS: 1257] [PPS: 1231 TPS: 208] [NT: 2 NP: 2 NA: 33]
[Time: 947] [Episode: 1054 Score: -20.0000] [RScore: -20.3530 RPPS: 1257] [PPS: 1232 TPS: 208] [NT: 2 NP: 2 NA: 33]
Thanks!
For some Atari games including Breakout, the environment sometimes waits for user input to continue (when the user looses a life, for example).
Game play may be stuck forever if 'no-op' action is set in such situations. To prevent this, ProcessAgent may need an action sequence for repeated 'no-op' actions and should take an 'real' action if the queue length goes beyond the limit.
Hi,
I want to reproduce the comparison between A3C and GA3C in Table 2 in your paper.
I wonder if the A3C experiment can be done using this repo?
Thanks
Hi,
From running the experiment, I found that the displayed value of RPPS, PPS, TPS
is consistently increasing for many episodes. How could I know for one configuration speedup over another configuration in a small number of episodes?
Thanks!
Setting up openai/universe, I used the "universe starter agent" as a smoke test.
After adjusting the number of workers to better utilize my CPU, I saw the default PongDeterministic-v3 start winning after about 45 minutes.
Then I wanted to try GA3C on the same machine; given that you quote results of 6x or better, I expected it to perform at least as good as that result.
However, it turns out that with GA3C the agent only starts winning after roughly 90 minutes.
I'm assuming that either my first (few) run(s) on the starter agent were just lucky, or that my runs on GA3C were unlucky. Also I assume that the starter agent has other changes from the A3C that you compared GA3C against, at least in parameters, possibly in algorithm.
So, what can I (an experienced software engineer but with no background in ML), do to make the two methods more comparable on my machine? Is it just a matter of tweaking a few parameters? Is Pong not a good choice to make the comparison?
I have an i7-3930k, a GTX 1060 (6 GB) and 32 GB of RAM.
./_train.sh: line 3: 3010 Segmentation fault (core dumped) python GA3C.py "$@"
Does anybody get a segmentation problem like this?
The issue is documented here, but I was wondering if you ever had any problems, receiving messages like this during training:
tensorflow/core/common_runtime/gpu/pool_allocator.cc:247] PoolAllocator: After 1648707130 get requests, put_count=1648707127 evicted_count=2741000 eviction_rate=0.00166251 and unsatisfied allocation rate=0.00166258
I get this message quite often when cloning this repo and running, untouched, on pong. The issue seems worse when training on a custom pygame I made, and one time the training ground to a stop completely, with no more output to results.txt, and the console full of these messages.
If you have never had problems like this with your network, then I will close the issue. Otherwise, any, advice would be greatly appreciated.
After training a model successfully using ./_train.sh, I am attempting to run and render my game using this model on OSx with ./_play.sh. When I run this, I receive the errors:
The process has forked and you cannot use this CoreFoundation functionality safely. You MUST exec().
Break on __THE_PROCESS_HAS_FORKED_AND_YOU_CANNOT_USE_THIS_COREFOUNDATION_FUNCTIONALITY___YOU_MUST_EXEC__() to debug.
The process has forked and you cannot use this CoreFoundation functionality safely. You MUST exec().
Break on __THE_PROCESS_HAS_FORKED_AND_YOU_CANNOT_USE_THIS_COREFOUNDATION_FUNCTIONALITY___YOU_MUST_EXEC__() to debug.
The process has forked and you cannot use this CoreFoundation functionality safely. You MUST exec().
Break on __THE_PROCESS_HAS_FORKED_AND_YOU_CANNOT_USE_THIS_COREFOUNDATION_FUNCTIONALITY___YOU_MUST_EXEC__() to debug.
After researching online, this seems to be an issue stemming from multiple processes attempting to render in parallel on OSx. However, the number of trainers, predictors, and agents are all 1. Additionally, I believe the error stems from the file ThreadDynamicAdjustment.py
line 37 and 38.
I am on macOS High Sierra version 10.13.3.
Python 3.6.4 :: Anaconda, Inc.
# packages in environment at /Users/<User>/miniconda3/envs/missionplanner:
#
absl-py 0.1.11 <pip>
astor 0.6.2 <pip>
bleach 1.5.0 <pip>
ca-certificates 2017.08.26 ha1e5d58_0
certifi 2018.1.18 py36_0
chardet 3.0.4 <pip>
cycler 0.10.0 <pip>
decorator 4.2.1 <pip>
future 0.16.0 <pip>
gast 0.2.0 <pip>
grpcio 1.10.0 <pip>
gym 0.9.7 <pip>
gym-cap 0.2 <pip>
html5lib 0.9999999 <pip>
idna 2.6 <pip>
kiwisolver 1.0.1 <pip>
libcxx 4.0.1 h579ed51_0
libcxxabi 4.0.1 hebd6815_0
libedit 3.1 hb4e282d_0
libffi 3.2.1 h475c297_4
Markdown 2.6.11 <pip>
matplotlib 2.2.0 <pip>
ncurses 6.0 hd04f020_2
networkx 2.1 <pip>
numpy 1.14.0 <pip>
openssl 1.0.2n hdbc3d79_0
pandas 0.22.0 <pip>
Pillow 5.0.0 <pip>
pip 9.0.1 py36h1555ced_4
protobuf 3.5.2 <pip>
pygame 1.9.3 <pip>
pyglet 1.4.0a1 <pip>
pyparsing 2.2.0 <pip>
python 3.6.4 hc167b69_1
python-dateutil 2.6.1 <pip>
pytz 2018.3 <pip>
PyWavelets 0.5.2 <pip>
readline 7.0 hc1231fa_4
requests 2.18.4 <pip>
scikit-image 0.13.1 <pip>
scipy 1.0.0 <pip>
setuptools 38.4.0 py36_0
six 1.11.0 <pip>
sqlite 3.22.0 h3efe00b_0
tensorboard 1.6.0 <pip>
tensorflow 1.6.0 <pip>
termcolor 1.1.0 <pip>
tk 8.6.7 h35a86e2_3
urllib3 1.22 <pip>
Werkzeug 0.14.1 <pip>
wheel 0.30.0 py36h5eb2c71_1
xz 5.2.3 h0278029_2
zlib 1.2.11 hf3cbc9b_2
In the NetworkVP.py file on line 93, the activation function should be explicitly set to None. As it currently stands, the logits are being put through a relu non-linearity before sotfmax is applied.
Hello!
In file ThreadTrain.py
x__ = np.concatenate((x__, x_))
As x_ may shorter than TMAX, should we conduct padding before concatenate?
It is a great work. Is there any plan to develop a LSTM version?
After running the _train.sh with the default Config.py on a DGX-1 for about an hour I see that the CPU usage stays pretty constant at about 15%, and one GPU is being used at about 40%.
The settngs in Config.py are unchanged: DYNAMIC_SETTINGS = True. The number of trainers varies between 2 and 6, the number of predictors varies between 1 and 2 and the number of agents varies from 34 to 39. I would have expected them to grow to use the available CPU resources.
for d in ['/gpu:0', '/gpu:1', '/gpu:2', '/gpu:3']:
with tf.device(d):
.... calcs here...
The code runs fine but leaks CPU and Memory and will crush your system . I am using Glances diagnostic or monitoring tool ( pip install glances ) . You will notice that if you leave your code running for a long time the CPU context switches increases substantially and the CPU & Memory keeps increasing until your code hangs or crushes . CPU usage increased from 6.7% to 64% and Memory from 10% to 79% at that point it caused the system freeze. When i look at the Nvidia TITAN X ( Maxwell --12 GB mem) usage it is only using about 300 MB out 12 GB. So it seems while most of the heavy lifting should be offloaded to the GPU in this case it does not seem to be the case. I have 8 x TITAN Maxwell GPUs with 2 x Intel Xeon 2660 v3
(2 CPU with total 40 CPU Cores ) with 128GB of DDR4 memory and i can use any of them . Still i get same results , the CPU will keep increasing
Any insights?
Other original A3C or various hybrid ( CPU & GPU ) versions seem to offload most of the heavy lifting to GPU and causes no system freezes but not with GA3C
Testing it on various amounts of data and games
Hello!
Why the ProcessAgent use Process while the ThreadTrainer use Thread? I wonder whether the ProcessAgent.py could use Thread instead of Process
When saw NetworkVP.py code, It has 2 layers of convolutional network and a dense layer. And the above comment says # As implemented in A3C paper
.
But Asynchronous Methods for Deep Reinforcement Learning paper does not describe structure of network layers. So I think that A3C has same model structure with DQN because of comparing between A3C and DQN.
In DQN, It has 3 layers of convolutional network and a dense layer. But the GA3C code is not.
I've just finished reading the GA3C paper, and I didn't notice any mention of action repeat (as implemented in original Deep-Q paper, and A3C paper). From my understanding, action repeat means that, not only are frames stacked together (into an effective single state) in groups of 4, but actions are also only selected every 4 frames, with these action selections then being carried through (i.e. repeated) until another 4 frames have passed, ready for another action selection.
These correspond to two different hyperparameters from my understanding, one for how many frames to stack (refered to as "m" in Deep-Q methods section), and one for how often to select actions (referred to as "k" in Deep-Q methods section). These are then also explicity referrenced in A3C Experimental Setup section.
I may be wrong, but I've gone through this source code, and from my understanding, different actions are selected every single frame, not every 4th frame (or Kth frame, to be more general). It seems as though there is no option to change this, to more closely replicate DeepMind's A3C approach. I was wondering if this was a design decision? If not, is there any plan to incorporate action replay options into the model, with a parameter in the Config.py file etc. ?
Alternatively, if I've misinterpretted the code (which is entirely possible), I would be grateful if you could point out where/how the action replay is incorporated.
I will try implementing it efficiently myself in the meantime, which you are welcome to use if helpful.
P.s. this code is proving to be extremely useful!
Thanks a lot for sharing!
Dan
Hi,
I am running testing(playing) for 500 iterations. And then I want to automatically start and other job. But I can not do that, because the 2 of the 3 GA3C jobs do not end and need to be killed by .
The reasons seems to be that the predictor gets stuck and waits for an agent that is already terminated (see the end of Server.py / ThreadPredictor.py). Could you please look at that?
Thanks,
Ernst
Known issue. Adding it to keep track of it.
In ThreadTrainer.py, I don't understand how the following lines are supposed to affect the batch size :
np.concatenate((x__, x_))
np.concatenate((r__, r_))
np.concatenate((a__, a_))
np.concatenate returns the merged array, but does not affect x__ or x_.
However, I do measure the TPS to drops. What sorcery is this ?
[Time: 404] [Episode: 213 Score: -1.0642] [RScore: 7.5345 RPPS: 281] [PPS: 282 TPS: 4] [NT: 2 NP: 3 NA: 4]
(The PPS/ TPS is overall low in my case because the game is a costly one running on remote desktop)
EDIT : i suggest to modify to :
x__ = np.concatenate((x__, x_))
r__ = np.concatenate((r__, r_))
a__ = np.concatenate((a__, a_))
but this does not affect TPS compared to other
Because of the use of the deprecated gym.undo_logger_setup()
method (openai/gym@4c460ba), GA3C no longer works with the most recent versions. Using this method is no longer required as OpenAI Gym no longer modifies the global logging configuration.
I've submitted a pull request (#42) to remove the use of the deprecated method.
Thank you for the easy to use and fast A3C implementation. I created a simple problem for rapid testing that rewards 0 on all steps except the terminal step, where it rewards either -1 or 1. GA3C cannot learn this problem because of line 107 in ProcessAgent.py:
terminal_reward = 0 if done else value
which causes the agent to ignore the only meaningful reward in this environment, and line 63 in ProcessAgent.py:
return experiences[:-1]
which causes the agent to ignore the only meaningful experience in this environment.
This is easily fixed by changing line 107 in ProcessAgent.py to
terminal_reward = reward if done else value
and _accumulate_rewards() in ProcessAgent.py to return all experiences if the agent has taken a terminal step. These changes should generally increase performance as terminal steps often contain valuable reward signal.
Hi!
Thanks for a great article! I'm interesting in it and want to try some ideas.
When you are planning to release the framework? Can you provide some estimation of release date?
I am trying a training right now, I replaced PongDeterministic-v0 by Pong-v0 (the former does not seem to exist in my install), other than that everything is the same in Config.py & other files.
After 2 hours :
[Time: 7795] [Episode: 7923 Score: -20.0000] [RScore: -20.2960 RPPS: 1488] [PPS: 1499 TPS: 251] [NT: 5 NP: 4 NA: 28]
Am I am missing something here? Is there a need to modify Config.py ?
EDIT : needed to update gym; retried with PongDeterministic-v0:
Here it is with Learning Rate = 1e-3 and the game PongDeterministic-v0
[Time: 4177] [Episode: 4397 Score: -9.0000] [RScore: -10.5960 RPPS: 1645] [PPS: 1646 TPS: 278] [NT: 4 NP: 4 NA: 33]
Any idea of the difference between the 2 games ?
When I run the _play.sh script, I can't see the agent in action.
In NetworkVP, ThreadTrainer, we see that we call train_model with x, r, a (states, rewards, actions)
Wouldn't it be simpler to just to maintain an history of p,v inside for each agent number inside NetworkVP & compute loss + backprop when rewards come back in the train function with agent indices ?
Thus we would avoid recomputing forward (already been done in predict), possibly even accelerating the whole process?
Hello,
I'm writing you to discuss a problem that is not directly related with your code and application but is affecting my own efforts to apply RL on a more custom type of problem where we have an environment which is not atari-like.
In my custom environment I have noisy, difficult to interpret images which are my observations, and i can take a bunch of actions. For each image between 1 and 4 actions can be considered correct and between 8 to 5 actions are always incorrect.
This problem can be also formulated in a fully supervised manner, as a classification problem, where we ignore the fact that more than one action can be considered correct at a time and that these actions are related to each other over time and define a "trajectory" of actions.
When we use the supervised approach in the way i just described the system works well, meaning that there is no struggle interpreting those noisy images that are difficult to understand by humans.
When these images are organized in a structured manner, in an environment one can play with, it's possible to use a RL algorithm to solve the problem. We have tried with satisfactory results DQN, which works okay. In that case the reward signal is provided continuously, for an action that goes in the right direction we assign a reward of 0.05, for an action that goes in the wrong one -0.15, for a "done" action issued correctly +1 and for a "done" actions issued incorrectly -0.25. A "done" action doesn't terminate the episode (it is terminated after 50 steps). DQN in these settings very slowly converges and shows nice validation results.
When we employ A3C, the behavior is either:
I am very puzzled by this behavior. I have checked and re-checked every moving piece. the environment, the inputs to the network in terms of images, the rewards, the distribution of different actions over time (such that i see if the network is just learning to issue always the same action for some reason). All these problems seem to be absent. I thought it was a problem of exploration vs. exploitation and therefore i reduced first and then increased beta. i have also dampened beta over time to see what happens, but the most i obtained was a sinusoidal kind of behavior of the reward.
I also have tried to use a epsilon greedy strategy (just for debug purposes) instead of sampling from the policy distribution, with no success (network converges to the worst possible scenario rather quickly).
I tried reducing learning rate with no success.
Now, the policy gradient loss is not exactly the same as the cross-entropy loss but it resembles it quite a bit. With an epsilon greedy policy i would expect that for each image (we have a limited number of images (observations) that are re-proposed when the environment reaches a similar state) all the possible actions are actually explored and therefore the policy is learned in a way that is not so far away from the supervised case. If i set discount factor to zero (which i have tried), the value part of the network does not really play a role (i might be mistaken though) and if i give a reward for each step i take i should kind of converge to something that resembles my classification approach that i described above.
Maybe the fact that multiple actions can generate the same reward or penalty is the problem ?!
I would immensely appreciate any help or thought coming from you. Despite I'm really motivated about applying RL to my specific problem I really don't know what to do to improve the situation.
Thanks a lot,
Fausto
I believe there is a bug in the A3C algorithm implementation. In the file "ProcessAgent.py" on line 107. The sub-episode return should be the value in the next state not the previous.
I suggest replacing:
prediction, value = self.predict(self.env.current_state)
...
if done or time_count == Config.TIME_MAX:
terminal_reward = 0 if done else value
with:
prediction, value = self.predict(self.env.current_state)
...
if done or time_count == Config.TIME_MAX:
terminal_reward = 0
if not done:
(_, terminal_reward) = self.predict(self.env.current_state)
https://github.com/ikostrikov/pytorch-a3c has an implementation (CPU ONLY) that can converge PongDeterministic-v3 within 15 minutes while the GPU powered GA3C appears to take 2-3 hours to achieve the same?
Based on my (limited) comparison they are using ADAM instead of RMSProp and using PongDeterministic-v3 instead of PongDeterministic-v0.
Maybe there is an incredible amount of overhead pushing data to the GPU so only with large models would see a true speedup?
In Server.py, line 108, set trainer.enabled = False
, may be useless?Here the 'trainer' is class 'ThreadTrainer'.
Not call anywhere?
And 'enabled' property only see in class ThreadDynamicAdjustment
, not found class 'ThreadTrainer'
I have tested the code on GTX 1080 with 32g Ram but when I run the code memory usage increases over time and after about 30 hours it will take all 32g of ram and make system to dies
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.