nvlabs / ga3c Goto Github PK

View Code? Open in Web Editor NEW

645.0 72.0 196.0 27 KB

Hybrid CPU/GPU implementation of the A3C algorithm for deep reinforcement learning.

License: BSD 3-Clause "New" or "Revised" License

Python 99.61% Shell 0.39%

ga3c's People

Contributors

Stargazers

Watchers

Forkers

codeaudit ml-lab stevenlol kafemoka andyxieyong trigrass2 orchestor dylanthomas 4skynet coderx7 mbz emrul leliaonvidia etienne87 babaktr ajaytalati cantren spencerduncan asmith26 tabzraz ernsttmp wilsonwangthu prichemond wang90063 wsjeon amoliu gortium marknader nczempin net-mist mistobaan stevekapturowski lkhcnn miffyli floodsung gbyfbi 0xsuu chpyang0229 dremovd emigmo rstager wuntoguo ymao1993 tgangwani collector-m pshvechikov shaobintao shivajid last-g bdholt1 adajass hma02 gwding dhfromkorea mihahauke zencoding agentdanger alanxu89 skylian pedronahum liuweiming chenaddsix kismuz ilovescienceandpython davidtranno1 anisoptera roy-algoritm meelement haha-533 amano-ginji yizhi-fang weinima12 dabana ontree hongdazhang kaizeonwong fo40225 qyxqyx endymion64 projectvenom coinsyx timerstime hzceee joydosun tangbohu ttong1013 codegank daominglyu ioandraganai jmribeiro lacibeb darkdepth karlxue ituco dogordog ai3dvision crnsmile grseb9s bilio gongheguoyingpai

ga3c's Issues

Training on environments with long episode length

Hello!

I'm currently trying to train on a problem that requires anywhere from 500 to 10,000 steps per episode. The training for this is excruciatingly slow when using the default config values. I've been messing around with some of the parameters but haven't been making any headway. Any recommendations ways to improve the training speed?

edit: the main thing i tried to modify was setting tmax to a very large number to try and batch each episode into a single update. This helped, however not as much as I hoped.

How do you make your that sess.run is not run at the same time or while in use?

This is more a question than an issue. I was looking at the code, and I realized that you use the functions like train, predict_p_and_v, predict_p and some others from file NetworkVP.py. All this functions use the method sess.run, my question is: how do you know that this code runs without a problem? I ask because as far as I can see, there's nothing controlling that sess.run is not called while another sess.run is in use. I thought that I needed to use a coordinator to do something like that. If you have a reference just to understand, that'd be great. Thanks!

server.py is adding 'trainer' instead of removing it at the end of training.

in File server.py, line 131:
while self.trainers:
self.add_trainer()

Should it be self.remove_trainer() ?

Frame Preprocessing Step (for flickering)

I've just finished reading the GA3C paper, and I didn't notice any mention of the frame processing to remove flickering artifacts, as explained in DeepMind's original Deep-Q paper methods section, and referenced in the experimental setup for the A3C paper.

I've just had a look through the source code, and it seems as though the the Environment._preprocess(image) function is applied on a frame by frame basis, with no joint thresholding across the previous frame. I was wondering if this was a design decision? If not, is there any plan to incorporate this preprocessing step into the model, to more closely resemble DeepMind's A3C approach?

Alternatively, if I've misinterpretted the code (which is entirely possible), I would be grateful if you could point out where/how this preprocessing step is incorporated.

I will try implementing it efficiently myself in the meantime, which you are welcome to use if helpful.

P.s. this code is proving to be extremely useful!
Thanks a lot for sharing!

Dan

pyTorch

Isn't there any plan on the horizon to port this code to pyTorch ?

Meaning of the RScore

Hi,

What is the meaning of the minus Rscore from the output I get? Should I apply abs(Rscore) to get the actual reward?
and how do I know when it is near to converge?

[Time:      943] [Episode:     1046 Score:   -20.0000] [RScore:   -20.3550 RPPS:  1255] [PPS:  1228 TPS:   208] [NT:  2 NP:  2 NA: 33]
[Time:      943] [Episode:     1047 Score:   -20.0000] [RScore:   -20.3550 RPPS:  1256] [PPS:  1229 TPS:   208] [NT:  2 NP:  2 NA: 33]
[Time:      943] [Episode:     1048 Score:   -19.0000] [RScore:   -20.3530 RPPS:  1259] [PPS:  1230 TPS:   208] [NT:  2 NP:  2 NA: 33]
[Time:      944] [Episode:     1049 Score:   -21.0000] [RScore:   -20.3530 RPPS:  1259] [PPS:  1231 TPS:   208] [NT:  2 NP:  2 NA: 33]
[Time:      944] [Episode:     1050 Score:   -19.0000] [RScore:   -20.3520 RPPS:  1258] [PPS:  1231 TPS:   208] [NT:  2 NP:  2 NA: 33]
[Time:      944] [Episode:     1051 Score:   -20.0000] [RScore:   -20.3530 RPPS:  1259] [PPS:  1232 TPS:   208] [NT:  2 NP:  2 NA: 33]
[Time:      945] [Episode:     1052 Score:   -20.0000] [RScore:   -20.3530 RPPS:  1259] [PPS:  1233 TPS:   208] [NT:  2 NP:  2 NA: 33]
[Time:      947] [Episode:     1053 Score:   -20.0000] [RScore:   -20.3530 RPPS:  1257] [PPS:  1231 TPS:   208] [NT:  2 NP:  2 NA: 33]
[Time:      947] [Episode:     1054 Score:   -20.0000] [RScore:   -20.3530 RPPS:  1257] [PPS:  1232 TPS:   208] [NT:  2 NP:  2 NA: 33]

Thanks!

Need an action trigger for 'press to continue' kind of situations

For some Atari games including Breakout, the environment sometimes waits for user input to continue (when the user looses a life, for example).
Game play may be stuck forever if 'no-op' action is set in such situations. To prevent this, ProcessAgent may need an action sequence for repeated 'no-op' actions and should take an 'real' action if the queue length goes beyond the limit.

how to run CPU version of A3C

Hi,

I want to reproduce the comparison between A3C and GA3C in Table 2 in your paper.

I wonder if the A3C experiment can be done using this repo?

Thanks

Why the RPPS, PPS, TPS are consistently increasing

Hi,

From running the experiment, I found that the displayed value of RPPS, PPS, TPS is consistently increasing for many episodes. How could I know for one configuration speedup over another configuration in a small number of episodes?

Thanks!

Trying to compare this to universe-starter-agent (A3C)

Setting up openai/universe, I used the "universe starter agent" as a smoke test.

After adjusting the number of workers to better utilize my CPU, I saw the default PongDeterministic-v3 start winning after about 45 minutes.

Then I wanted to try GA3C on the same machine; given that you quote results of 6x or better, I expected it to perform at least as good as that result.

However, it turns out that with GA3C the agent only starts winning after roughly 90 minutes.

I'm assuming that either my first (few) run(s) on the starter agent were just lucky, or that my runs on GA3C were unlucky. Also I assume that the starter agent has other changes from the A3C that you compared GA3C against, at least in parameters, possibly in algorithm.

So, what can I (an experienced software engineer but with no background in ML), do to make the two methods more comparable on my machine? Is it just a matter of tweaking a few parameters? Is Pong not a good choice to make the comparison?

I have an i7-3930k, a GTX 1060 (6 GB) and 32 GB of RAM.

Segmentation fault

./_train.sh: line 3: 3010 Segmentation fault (core dumped) python GA3C.py "$@"

Does anybody get a segmentation problem like this?

Training Slowdown

The issue is documented here, but I was wondering if you ever had any problems, receiving messages like this during training:

tensorflow/core/common_runtime/gpu/pool_allocator.cc:247] PoolAllocator: After 1648707130 get requests, put_count=1648707127 evicted_count=2741000 eviction_rate=0.00166251 and unsatisfied allocation rate=0.00166258

I get this message quite often when cloning this repo and running, untouched, on pong. The issue seems worse when training on a custom pygame I made, and one time the training ground to a stop completely, with no more output to results.txt, and the console full of these messages.

If you have never had problems like this with your network, then I will close the issue. Otherwise, any, advice would be greatly appreciated.

./_play.sh not working on OSx

After training a model successfully using ./_train.sh, I am attempting to run and render my game using this model on OSx with ./_play.sh. When I run this, I receive the errors:

The process has forked and you cannot use this CoreFoundation functionality safely. You MUST exec().
Break on __THE_PROCESS_HAS_FORKED_AND_YOU_CANNOT_USE_THIS_COREFOUNDATION_FUNCTIONALITY___YOU_MUST_EXEC__() to debug.
The process has forked and you cannot use this CoreFoundation functionality safely. You MUST exec().
Break on __THE_PROCESS_HAS_FORKED_AND_YOU_CANNOT_USE_THIS_COREFOUNDATION_FUNCTIONALITY___YOU_MUST_EXEC__() to debug.
The process has forked and you cannot use this CoreFoundation functionality safely. You MUST exec().
Break on __THE_PROCESS_HAS_FORKED_AND_YOU_CANNOT_USE_THIS_COREFOUNDATION_FUNCTIONALITY___YOU_MUST_EXEC__() to debug.

After researching online, this seems to be an issue stemming from multiple processes attempting to render in parallel on OSx. However, the number of trainers, predictors, and agents are all 1. Additionally, I believe the error stems from the file ThreadDynamicAdjustment.py line 37 and 38.

I am on macOS High Sierra version 10.13.3.

Python 3.6.4 :: Anaconda, Inc.

# packages in environment at /Users/<User>/miniconda3/envs/missionplanner:
#
absl-py                   0.1.11                    <pip>
astor                     0.6.2                     <pip>
bleach                    1.5.0                     <pip>
ca-certificates           2017.08.26           ha1e5d58_0  
certifi                   2018.1.18                py36_0  
chardet                   3.0.4                     <pip>
cycler                    0.10.0                    <pip>
decorator                 4.2.1                     <pip>
future                    0.16.0                    <pip>
gast                      0.2.0                     <pip>
grpcio                    1.10.0                    <pip>
gym                       0.9.7                     <pip>
gym-cap                   0.2                       <pip>
html5lib                  0.9999999                 <pip>
idna                      2.6                       <pip>
kiwisolver                1.0.1                     <pip>
libcxx                    4.0.1                h579ed51_0  
libcxxabi                 4.0.1                hebd6815_0  
libedit                   3.1                  hb4e282d_0  
libffi                    3.2.1                h475c297_4  
Markdown                  2.6.11                    <pip>
matplotlib                2.2.0                     <pip>
ncurses                   6.0                  hd04f020_2  
networkx                  2.1                       <pip>
numpy                     1.14.0                    <pip>
openssl                   1.0.2n               hdbc3d79_0  
pandas                    0.22.0                    <pip>
Pillow                    5.0.0                     <pip>
pip                       9.0.1            py36h1555ced_4  
protobuf                  3.5.2                     <pip>
pygame                    1.9.3                     <pip>
pyglet                    1.4.0a1                   <pip>
pyparsing                 2.2.0                     <pip>
python                    3.6.4                hc167b69_1  
python-dateutil           2.6.1                     <pip>
pytz                      2018.3                    <pip>
PyWavelets                0.5.2                     <pip>
readline                  7.0                  hc1231fa_4  
requests                  2.18.4                    <pip>
scikit-image              0.13.1                    <pip>
scipy                     1.0.0                     <pip>
setuptools                38.4.0                   py36_0  
six                       1.11.0                    <pip>
sqlite                    3.22.0               h3efe00b_0  
tensorboard               1.6.0                     <pip>
tensorflow                1.6.0                     <pip>
termcolor                 1.1.0                     <pip>
tk                        8.6.7                h35a86e2_3  
urllib3                   1.22                      <pip>
Werkzeug                  0.14.1                    <pip>
wheel                     0.30.0           py36h5eb2c71_1  
xz                        5.2.3                h0278029_2  
zlib                      1.2.11               hf3cbc9b_2

Unnecessary relu applied to action probability logits

In the NetworkVP.py file on line 93, the activation function should be explicitly set to None. As it currently stands, the logits are being put through a relu non-linearity before sotfmax is applied.

should conduct padding before training?

Hello!
In file ThreadTrain.py
x__ = np.concatenate((x__, x_))

As x_ may shorter than TMAX, should we conduct padding before concatenate?

LSTM version

It is a great work. Is there any plan to develop a LSTM version?

Suggested Config.py settings for a DGX-1

After running the _train.sh with the default Config.py on a DGX-1 for about an hour I see that the CPU usage stays pretty constant at about 15%, and one GPU is being used at about 40%.

The settngs in Config.py are unchanged: DYNAMIC_SETTINGS = True. The number of trainers varies between 2 and 6, the number of predictors varies between 1 and 2 and the number of agents varies from 34 to 39. I would have expected them to grow to use the available CPU resources.

Are there settings that will better leverage the cores on a DGX-1?
It looks like the code in NetworkVP.py is written for a single GPU. With TensorFlow's support for multiple GPU's, do you have plans to add it? On the surface it seems pretty easy to add:

for d in ['/gpu:0', '/gpu:1', '/gpu:2', '/gpu:3']:    
    with tf.device(d):
       .... calcs here...

GA3C source code has High CPU usage causing System freeze or crash

The code runs fine but leaks CPU and Memory and will crush your system . I am using Glances diagnostic or monitoring tool ( pip install glances ) . You will notice that if you leave your code running for a long time the CPU context switches increases substantially and the CPU & Memory keeps increasing until your code hangs or crushes . CPU usage increased from 6.7% to 64% and Memory from 10% to 79% at that point it caused the system freeze. When i look at the Nvidia TITAN X ( Maxwell --12 GB mem) usage it is only using about 300 MB out 12 GB. So it seems while most of the heavy lifting should be offloaded to the GPU in this case it does not seem to be the case. I have 8 x TITAN Maxwell GPUs with 2 x Intel Xeon 2660 v3
(2 CPU with total 40 CPU Cores ) with 128GB of DDR4 memory and i can use any of them . Still i get same results , the CPU will keep increasing

Any insights?

Other original A3C or various hybrid ( CPU & GPU ) versions seem to offload most of the heavy lifting to GPU and causes no system freezes but not with GA3C

Testing it on various amounts of data and games

Why the ProcessAgent use Process while the ThreadTrainer use Thread?

Hello!

Why the ProcessAgent use Process while the ThreadTrainer use Thread? I wonder whether the ProcessAgent.py could use Thread instead of Process

Different network layer structure from DQN

When saw NetworkVP.py code, It has 2 layers of convolutional network and a dense layer. And the above comment says # As implemented in A3C paper.

But Asynchronous Methods for Deep Reinforcement Learning paper does not describe structure of network layers. So I think that A3C has same model structure with DQN because of comparing between A3C and DQN.

In DQN, It has 3 layers of convolutional network and a dense layer. But the GA3C code is not.

Action Repeat Option

I've just finished reading the GA3C paper, and I didn't notice any mention of action repeat (as implemented in original Deep-Q paper, and A3C paper). From my understanding, action repeat means that, not only are frames stacked together (into an effective single state) in groups of 4, but actions are also only selected every 4 frames, with these action selections then being carried through (i.e. repeated) until another 4 frames have passed, ready for another action selection.

These correspond to two different hyperparameters from my understanding, one for how many frames to stack (refered to as "m" in Deep-Q methods section), and one for how often to select actions (referred to as "k" in Deep-Q methods section). These are then also explicity referrenced in A3C Experimental Setup section.

I may be wrong, but I've gone through this source code, and from my understanding, different actions are selected every single frame, not every 4th frame (or Kth frame, to be more general). It seems as though there is no option to change this, to more closely replicate DeepMind's A3C approach. I was wondering if this was a design decision? If not, is there any plan to incorporate action replay options into the model, with a parameter in the Config.py file etc. ?

Alternatively, if I've misinterpretted the code (which is entirely possible), I would be grateful if you could point out where/how the action replay is incorporated.

I will try implementing it efficiently myself in the meantime, which you are welcome to use if helpful.

P.s. this code is proving to be extremely useful!
Thanks a lot for sharing!

Dan

Playing hangs after last episode

Hi,
I am running testing(playing) for 500 iterations. And then I want to automatically start and other job. But I can not do that, because the 2 of the 3 GA3C jobs do not end and need to be killed by .

The reasons seems to be that the predictor gets stuck and waits for an agent that is already terminated (see the end of Server.py / ThreadPredictor.py). Could you please look at that?

Thanks,
Ernst

The training process does not exit and has to be killed by Ctrl+C

Known issue. Adding it to keep track of it.

TRAINING_MIN_BATCH_SIZE does not seem to affect anything

In ThreadTrainer.py, I don't understand how the following lines are supposed to affect the batch size :

np.concatenate((x__, x_))
np.concatenate((r__, r_))
np.concatenate((a__, a_))

np.concatenate returns the merged array, but does not affect x__ or x_.

However, I do measure the TPS to drops. What sorcery is this ?

[Time: 404] [Episode: 213 Score: -1.0642] [RScore: 7.5345 RPPS: 281] [PPS: 282 TPS: 4] [NT: 2 NP: 3 NA: 4]

(The PPS/ TPS is overall low in my case because the game is a costly one running on remote desktop)

EDIT : i suggest to modify to :

x__ = np.concatenate((x__, x_))
r__ = np.concatenate((r__, r_))
a__ = np.concatenate((a__, a_))

but this does not affect TPS compared to other

Incompatibility with the most recent releases of OpenAI Gym

Because of the use of the deprecated gym.undo_logger_setup() method (openai/gym@4c460ba), GA3C no longer works with the most recent versions. Using this method is no longer required as OpenAI Gym no longer modifies the global logging configuration.

I've submitted a pull request (#42) to remove the use of the deprecated method.

Cannot learn problems with a single, terminal reward

Thank you for the easy to use and fast A3C implementation. I created a simple problem for rapid testing that rewards 0 on all steps except the terminal step, where it rewards either -1 or 1. GA3C cannot learn this problem because of line 107 in ProcessAgent.py:

terminal_reward = 0 if done else value

which causes the agent to ignore the only meaningful reward in this environment, and line 63 in ProcessAgent.py:

return experiences[:-1]

which causes the agent to ignore the only meaningful experience in this environment.

This is easily fixed by changing line 107 in ProcessAgent.py to

terminal_reward = reward if done else value

and _accumulate_rewards() in ProcessAgent.py to return all experiences if the agent has taken a terminal step. These changes should generally increase performance as terminal steps often contain valuable reward signal.

Release date

Hi!
Thanks for a great article! I'm interesting in it and want to try some ideas.
When you are planning to release the framework? Can you provide some estimation of release date?

test with Pong-v0 : not converging ?

I am trying a training right now, I replaced PongDeterministic-v0 by Pong-v0 (the former does not seem to exist in my install), other than that everything is the same in Config.py & other files.

After 2 hours :

[Time: 7795] [Episode: 7923 Score: -20.0000] [RScore: -20.2960 RPPS: 1488] [PPS: 1499 TPS: 251] [NT: 5 NP: 4 NA: 28]

Am I am missing something here? Is there a need to modify Config.py ?

EDIT : needed to update gym; retried with PongDeterministic-v0:

Here it is with Learning Rate = 1e-3 and the game PongDeterministic-v0

[Time: 4177] [Episode: 4397 Score: -9.0000] [RScore: -10.5960 RPPS: 1645] [PPS: 1646 TPS: 278] [NT: 4 NP: 4 NA: 33]

Any idea of the difference between the 2 games ?

Cannot see the agents in action during testing

When I run the _play.sh script, I can't see the agent in action.

There is a unused variable

In this code.

yield None, None, None, total_reward

total_reward variable is not used in that code.

why do we need x as argument of train() ?

In NetworkVP, ThreadTrainer, we see that we call train_model with x, r, a (states, rewards, actions)

Wouldn't it be simpler to just to maintain an history of p,v inside for each agent number inside NetworkVP & compute loss + backprop when rewards come back in the train function with agent indices ?

Thus we would avoid recomputing forward (already been done in predict), possibly even accelerating the whole process?

Issues with learning in custom environment

Hello,

I'm writing you to discuss a problem that is not directly related with your code and application but is affecting my own efforts to apply RL on a more custom type of problem where we have an environment which is not atari-like.

In my custom environment I have noisy, difficult to interpret images which are my observations, and i can take a bunch of actions. For each image between 1 and 4 actions can be considered correct and between 8 to 5 actions are always incorrect.

This problem can be also formulated in a fully supervised manner, as a classification problem, where we ignore the fact that more than one action can be considered correct at a time and that these actions are related to each other over time and define a "trajectory" of actions.
When we use the supervised approach in the way i just described the system works well, meaning that there is no struggle interpreting those noisy images that are difficult to understand by humans.

When these images are organized in a structured manner, in an environment one can play with, it's possible to use a RL algorithm to solve the problem. We have tried with satisfactory results DQN, which works okay. In that case the reward signal is provided continuously, for an action that goes in the right direction we assign a reward of 0.05, for an action that goes in the wrong one -0.15, for a "done" action issued correctly +1 and for a "done" actions issued incorrectly -0.25. A "done" action doesn't terminate the episode (it is terminated after 50 steps). DQN in these settings very slowly converges and shows nice validation results.

When we employ A3C, the behavior is either:

Reward goes up for a bit, then stabilizes to a very poor value, never moves again (we tried also dampening beta as we optimize and we got an identical behaviour)
Reward fluctuates and then drift in a way that for 50 steps of the episode a correct action is NEVER done. meaning the -0.15 reward is always the one that is obtained for all the steps, like the network would have learned perfectly how not to do things.

I am very puzzled by this behavior. I have checked and re-checked every moving piece. the environment, the inputs to the network in terms of images, the rewards, the distribution of different actions over time (such that i see if the network is just learning to issue always the same action for some reason). All these problems seem to be absent. I thought it was a problem of exploration vs. exploitation and therefore i reduced first and then increased beta. i have also dampened beta over time to see what happens, but the most i obtained was a sinusoidal kind of behavior of the reward.

I also have tried to use a epsilon greedy strategy (just for debug purposes) instead of sampling from the policy distribution, with no success (network converges to the worst possible scenario rather quickly).

I tried reducing learning rate with no success.

Now, the policy gradient loss is not exactly the same as the cross-entropy loss but it resembles it quite a bit. With an epsilon greedy policy i would expect that for each image (we have a limited number of images (observations) that are re-proposed when the environment reaches a similar state) all the possible actions are actually explored and therefore the policy is learned in a way that is not so far away from the supervised case. If i set discount factor to zero (which i have tried), the value part of the network does not really play a role (i might be mistaken though) and if i give a reward for each step i take i should kind of converge to something that resembles my classification approach that i described above.

Maybe the fact that multiple actions can generate the same reward or penalty is the problem ?!

I would immensely appreciate any help or thought coming from you. Despite I'm really motivated about applying RL to my specific problem I really don't know what to do to improve the situation.

Thanks a lot,

Fausto

Wrong A3C implementation

I believe there is a bug in the A3C algorithm implementation. In the file "ProcessAgent.py" on line 107. The sub-episode return should be the value in the next state not the previous.

I suggest replacing:

prediction, value = self.predict(self.env.current_state)
           
...
            if done or time_count == Config.TIME_MAX:
                terminal_reward = 0 if done else value

with:

prediction, value = self.predict(self.env.current_state)
           
...
            if done or time_count == Config.TIME_MAX:
                terminal_reward = 0
               if not done:
                     (_, terminal_reward) = self.predict(self.env.current_state)

Why is pytorch-a3c implementation so much faster?

https://github.com/ikostrikov/pytorch-a3c has an implementation (CPU ONLY) that can converge PongDeterministic-v3 within 15 minutes while the GPU powered GA3C appears to take 2-3 hours to achieve the same?

Based on my (limited) comparison they are using ADAM instead of RMSProp and using PongDeterministic-v3 instead of PongDeterministic-v0.

Maybe there is an incredible amount of overhead pushing data to the GPU so only with large models would see a true speedup?

not use enabled?

In Server.py, line 108, set trainer.enabled = False, may be useless?Here the 'trainer' is class 'ThreadTrainer'.
Not call anywhere?
And 'enabled' property only see in class ThreadDynamicAdjustment, not found class 'ThreadTrainer'

memory usage growth after a while

I have tested the code on GTX 1080 with 32g Ram but when I run the code memory usage increases over time and after about 30 hours it will take all 32g of ram and make system to dies

nvlabs / ga3c Goto Github PK

ga3c's People

Contributors

Stargazers

Watchers

Forkers

ga3c's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs