miyosuda / async_deep_reinforce Goto Github PK

View Code? Open in Web Editor NEW

586.0 49.0 194.0 306 KB

Asynchronous Methods for Deep Reinforcement Learning

License: Apache License 2.0

Python 100.00%

tensorflow reinforcement-learning a3c deep-learning

async_deep_reinforce's People

Contributors

Stargazers

Watchers

Forkers

amoliu ml-lab sherjilozair expressoman floodsung gandalfvn joabim dmacd loofahcus peterzcc chagge shangxing2015 jorritvandenberg csdlrl zhangzongliang zbxzc35 calebchoo yao62995 dtbinh mazecreator vyraun itsukara zeyuan1987 sahiliitm zhuyiming sangy12 stevenpjg yieldthought zendevelopmentsystems solmonk pengcheng-wang cjluo aistrych mydeeplearning tyeah dylanthomas mihahauke nakosung nsimsiri complyue capybaralet mjc14 kaixianglin achao2013 mazhengmac mave5 deepalcoholic upepo renly xjwxjw njuhugn twmht oronanschel andrelip zencoding miradel51 williamd4112 dwiel taejulee aminzheng wsjeon sxjscience wang90063 asmengistu atanida emrul tatsuyah nakamoo wuntoguo ashern daerduocarey ieee820 joyhuang9473 yfliao pshegde96 ubyjvovk elffer aravind-suresh trigrass2 leliaonvidia sandyhsia nwayt001 marcushill happywu dongleecsu gabrieledcjr macgyverwang minghui keniuniu 4575759ww marknader vtpp2014 mightychaos wecacuee luonay tianwen2976 fanninnypeom xiaogengyaokeyan collector-m root-master

async_deep_reinforce's Issues

Problem with compiling the code

Hi, I am trying to compile and run your code. But I'm stuck when I do the step cmake -DUSE_SDL=ON -DUSE_RLGLUE=OFF -DBUILD_EXAMPLES=OFF .

Because I am starting to learn about RL and python. So it is a little difficult for me to fix the error. I would like to appreciate it if you can help me.

And the files are log files. Thank you~~
CMakeError (copy).txt
CMakeOutput (copy).txt

Error occurs at recent TensorFlow 0.12

Hi. I got the following message. I use TF 0.12.

File "/home/itl/async_deep_reinforce/a3c_training_thread.py", line 42, in init
var_refs = [v.ref() for v in self.local_network.get_vars()]
AttributeError: 'Variable' object has no attribute 'ref'

I found that 'ref' is changed to 'read_value' in the recent updates.
(http://stackoverflow.com/questions/40901391/what-is-the-alternative-of-tf-variable-ref-in-tensorflow-version-0-12)

Thank you.

After addressing the upper issue, I got another error.

WARNING:tensorflow:tf.op_scope(values, name, default_name) is deprecated, use tf.name_scope(name, default_name, values)
Traceback (most recent call last):
File "a3c.py", line 70, in
device = device)
File "/home/itl/async_deep_reinforce/a3c_training_thread.py", line 53, in init
self.gradients )
File "/home/itl/async_deep_reinforce/rmsprop_applier.py", line 103, in apply_gradients
clipped_accum_grad = tf.clip_by_norm(accum_grad, self._clip_norm)
File "/home/itl/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/python/ops/clip_ops.py", line 96, in clip_by_norm
t = ops.convert_to_tensor(t, name="t")
File "/home/itl/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 669, in convert_to_tensor
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
File "/home/itl/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/python/framework/constant_op.py", line 176, in _constant_tensor_conversion_function
return constant(v, dtype=dtype, name=name)
File "/home/itl/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/python/framework/constant_op.py", line 165, in constant
tensor_util.make_tensor_proto(value, dtype=dtype, shape=shape, verify_shape=verify_shape))
File "/home/itl/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/python/framework/tensor_util.py", line 360, in make_tensor_proto
raise ValueError("None values not supported.")
ValueError: None values not supported.

I think something wrong in RMSoptimizer for latest TF.

Large action space

Is this implementation suitable for Large action spaces? Like 1000s of possible actions?
Also, can this be modified to make it work with the problem I am trying to solve? I would like to use this implementation for modify the state and action spaces.

Training speed (in hours)

I am trying to get a sense of the time the implementation takes to achieves a good average score for Pong. The DeepMind paper gets to an average of +20 in about 4 hours of training. How does this implementation fair?

how do I do inference?

because I want to see the performance of inference, how do I do run inference instead of training?

Update algorithm works not correct?

Hi,

first I have to say, great work! Without your repository I wouldn't been able to get first results so quickly=)

I took your algorithm with your maze implementation. Unfortunately I had to change some stuff since the maze code is a bit older and does not include all your new implementations. After getting rid of all errors
I wasn't able to train anything for a long time until I recognized that a negative reward of <=-1 does not work at all.

What I did to proof this is: I initialized the network randomly and forced just one agent to always run directly into the wall during training. What I expected while doing this is that the probabilities for every action goes to 0.33 except the one which leads the agent directly into the wall. This should go to 0.
But I observed another behavior. The action that leads the agent into the wall converges to zero as expected but one of the other actions emphasizes to one and the other both to zero too.
As far as I observed the situation it's always the action with the highest starting probability directly after initialization which will be emphasized during this training process.
I can't imagine that this is a wanted behavior, or am I wrong?
When I clip the reward to min -0.9 it works fine!
My agent finds a way with minimal costs, even if the learning diverges after several thousand steps again.
I guess this could also be an issue addressing the same problem.

Regards, Martin

built environment

I use Windows system. After inputing "$ make -j 4" cmd will return error that can't recognise 'make'. So what can I do? I would like to test the code, gain some data and plot graphs like you have suggested in order to learn some knowledge about Asynchronous Methods for Deep Reinforcement Learning. If you can roughly tell me the method of training and analysis of result, I will be very grateful.

thank you!
my email: [email protected]

A3C-LSTM and DRQN

Dear

I wonder if the A3C-LSTM actually implements DRQN but on A3C instead of DQN ( https://arxiv.org/pdf/1507.06527.pdf )

If not, then could you please share the details/reference?

Finally, do you think GRU instead of LSTM will help in :

Reducing parameters to train
Speed up the training
Having almost the same accuracy

Thank you so much

Any Chance or plans of adding MULTI-GPU for training

is there any chance of you adding Multi-GPU for training . i have 4 Nvidia Titan X GPUs that i would like to use all for training

Problem while using the code

Hello @miyosuda,

Thanks for sharing the code, please ignore the title, I tried out your code with the control problem of cartpole balance experiment instead of Atari game, it works well. But few questions want to ask.

I am curious, in the asynchronous paper, they also used another model implementation with 1 linear layer, 1 LSTM, layer, and softmax output, I am thinking of using this model to see whether improve the result, can you suggest how the LSTM can be implemented using tensorflow in the case of playing atari game?

Also wondering that the accumulated states and reward were reversed, do you need to reverse the actions and values as well? Although it did not make any different when I tried out, just wondering why.

states.reverse()
rewards.reverse()

Last, do you really need to accumulate the gradient and then apply the update, since tensorflow can handle the 'batch' for update.

locking global_t when incrementing? not thread-safe?

is it ok that inside the train_function in a3c.py, when updating global_t, which is global and shared amongst the thread that is not being locked when being updated?

Append state issue

self.s_t1 = np.append(x_t1,self.s_t[:,:,0:3], axis = 2) is wrong
it should be
self.s_t1 = np.append(self.s_t[:,:,1:],x_t1, axis = 2)

Variable net_-1/BasicLSTMCell/Linear/Matrix does not exist, or was not created with tf.get_variable().

Hi @miyosuda ,
Thank you for sharing the code.
When I tried to run the code, I came across some problem.
'''
Traceback (most recent call last):
File "a3c.py", line 50, in
global_network = GameACLSTMNetwork(ACTION_SIZE, -1, device)
File "/home/lab/sk/a3c-new/async_deep_reinforce/game_ac_network.py", line 218, in init
self.W_lstm = tf.get_variable("BasicLSTMCell/Linear/Matrix")
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 988, in get_variable
custom_getter=custom_getter)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 890, in get_variable
custom_getter=custom_getter)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 348, in get_variable
validate_shape=validate_shape)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 333, in _true_getter
caching_device=caching_device, validate_shape=validate_shape)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 657, in get_single_variable
"VarScope?" % name)
ValueError: Variable net-1/BasicLSTMCell/Linear/Matrix does not exist, or was not created with tf.get_variable(). Did you mean to set reuse=None in VarScope?
'''
I have tf1.0 installed, and I have trouble debugging it out. I would appreciate it very much if you can help me.

some issues about running it on windows10

How can I config circumstance to support this project on win10?

python sets its version with 3.5 or 3.5.3
tensorflow ?
...

111 Any reason for choosing ACTION_SIZE = 3? Extension for continuous action?

I cannot figure out why you chooses ACTION_SIZE = 3.

In your branch 'gym', I find that someone changes it to 6.

As I remember, in DQN paper (Nature), ACTION_SIZE is greater than 10, and I think for some cases, it can affect the performance.

Any reason?

Also, do you have plan to extend your work for continuous action domain?

p.s.

I think your code is awesome! :)

About variable error.

Hello.
When I ran "a3c.py" or "a3c_display.py" , I got error log bellow.

I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally
Traceback (most recent call last):
File "a3c.py", line 50, in
global_network = GameACLSTMNetwork(ACTION_SIZE, -1, device)
File "/home/uchi/async_deep_reinforce/game_ac_network.py", line 218, in init
self.W_lstm = tf.get_variable("BasicLSTMCell/Linear/Matrix")
File "/home/uchi/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 988, in get_variable
custom_getter=custom_getter)
File "/home/uchi/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 890, in get_variable
custom_getter=custom_getter)
File "/home/uchi/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 348, in get_variable
validate_shape=validate_shape)
File "/home/uchi/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 333, in _true_getter
caching_device=caching_device, validate_shape=validate_shape)
File "/home/uchi/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 657, in get_single_variable
"VarScope?" % name)
ValueError: Variable net-1/BasicLSTMCell/Linear/Matrix does not exist, or was not created with tf.get_variable(). Did you mean to set reuse=None in VarScope?

How do I resolve this problem?

Failing to fully replicate Pong with A3C-LSTM

@miyosuda I have been trying to replicate your very nice Pong A3C-LSTM chart (https://github.com/miyosuda/async_deep_reinforce/blob/master/docs/graph_24h_lstm.png). So far unfortunately I have not really succeeded.

I have been using parameter settings in constants.py (setting USE_LSTM=True and USE_GPU=True). I have also been setting frame_skip=4 in ale.cfg (using master branch, tf r0.10, [email protected] - 8 threads, Nvidia Titan X).

My question: should the above setting of parameters be sufficient to reproduce your A3C-LSTM chart?

When doing the above my charts are looking as follows (I have done multiple runs, also simulating for more than 60M steps):

In my case learning seems to be much slower and saturating at around -5 to 0 (better seen on my simulations running for more than 60M steps).

Comparing with http://arxiv.org/abs/1602.01783 their Figure 3 (Pong, A3C - 8 threads) it seems your result is faster in terms of number of steps to reach score 20: you require ~20-25M steps to reach score 20, whereas DeepMind on average seemd to need ~50-60M (but then theirs is an average value and from their paper I dont know if it has been A3C-FF or A3C-LSTM).

My second question: Have you run A3C-FF / A3C-LSTM multiple times and were results similar? And do you have an explanation why A3C-FF is not reaching score 20? (I have also run A3C-FF which is looking similar to yours...)

Many many thanks for the code!!

Incorrect policy loss

Hi, I'm reading this repository to implement my own A3C. Then I found policy loss incorrect.

Current policy loss is at https://github.com/miyosuda/async_deep_reinforce/blob/master/game_ac_network.py#L31 .

# policy entropy
entropy = -tf.reduce_sum(self.pi * log_pi, reduction_indices=1)
      
 # policy loss (output)  (Adding minus, because the original paper's objective function is for gradient ascent, but we use gradient descent optimizer.)
policy_loss = - tf.reduce_sum( tf.reduce_sum( tf.multiply( log_pi, self.a ), reduction_indices=1 ) * self.td + entropy * entropy_beta )

So it's like policy_loss = -log(pi) * a + beta * entropy. In this case, entropy would be minimized.However, entropy should be maximized to avoid convergence for exploration. Original paper says,

Thus, correct policy loss should be policy_loss = -log(pi) * a - beta * entropy.

If I am wrong, just please close this issue. I hope this helps you improve this implementation.

@miyosuda

ValueError: Variable net_-1/basic_lstm_cell/weights does not exist, or was not created with tf.get_variable(). Did you mean to set reuse=None in VarScope?

Environment:
Python 3.5.2
TensorFlow 1.2.1

Traceback (most recent call last):
  File "a3c.py", line 70, in <module>
    device = device)
  File "/home/jack/Applications/async_deep_reinforce/a3c_training_thread.py", line 51, in __init__
    self.gradients )
  File "/home/jack/Applications/async_deep_reinforce/rmsprop_applier.py", line 103, in apply_gradients
    clipped_accum_grad = tf.clip_by_norm(accum_grad, self._clip_norm)
  File "/home/jack/python3/venv/lib/python3.5/site-packages/tensorflow/python/ops/clip_ops.py", line 107, in clip_by_norm
    t = ops.convert_to_tensor(t, name="t")
  File "/home/jack/python3/venv/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 676, in convert_to_tensor
    as_ref=False)
  File "/home/jack/python3/venv/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 741, in internal_convert_to_tensor
    ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
  File "/home/jack/python3/venv/lib/python3.5/site-packages/tensorflow/python/framework/constant_op.py", line 113, in _constant_tensor_conversion_function
    return constant(v, dtype=dtype, name=name)
  File "/home/jack/python3/venv/lib/python3.5/site-packages/tensorflow/python/framework/constant_op.py", line 102, in constant
    tensor_util.make_tensor_proto(value, dtype=dtype, shape=shape, verify_shape=verify_shape))
  File "/home/jack/python3/venv/lib/python3.5/site-packages/tensorflow/python/framework/tensor_util.py", line 364, in make_tensor_proto
    raise ValueError("None values not supported.")
ValueError: None values not supported.
(venv) jack@jack:~/Applications/async_deep_reinforce$ python a3c.py 
Traceback (most recent call last):
  File "a3c.py", line 50, in <module>
    global_network = GameACLSTMNetwork(ACTION_SIZE, -1, device)
  File "/home/jack/Applications/async_deep_reinforce/game_ac_network.py", line 218, in __init__
    self.W_lstm = tf.get_variable("basic_lstm_cell/weights")
  File "/home/jack/python3/venv/lib/python3.5/site-packages/tensorflow/python/ops/variable_scope.py", line 1065, in get_variable
    use_resource=use_resource, custom_getter=custom_getter)
  File "/home/jack/python3/venv/lib/python3.5/site-packages/tensorflow/python/ops/variable_scope.py", line 962, in get_variable
    use_resource=use_resource, custom_getter=custom_getter)
  File "/home/jack/python3/venv/lib/python3.5/site-packages/tensorflow/python/ops/variable_scope.py", line 367, in get_variable
    validate_shape=validate_shape, use_resource=use_resource)
  File "/home/jack/python3/venv/lib/python3.5/site-packages/tensorflow/python/ops/variable_scope.py", line 352, in _true_getter
    use_resource=use_resource)
  File "/home/jack/python3/venv/lib/python3.5/site-packages/tensorflow/python/ops/variable_scope.py", line 682, in _get_single_variable
    "VarScope?" % name)
ValueError: Variable net_-1/basic_lstm_cell/weights does not exist, or was not created with tf.get_variable(). Did you mean to set reuse=None in VarScope?

Pong's minimal action set. I can't gain scores more than -21 using the first three ones.

I just download the code and run it.
However, after one evening's training, I can't get scores more than -21.
Is [0, 1, 3] worked?
The minimal action_set is [ 0 1 3 4 11 12].
When I included action 4, it's easier to have scores like -20 and -19 in the beginning.

What goes wrong in my procedures?

running speed with different number of threads

Have you tested the running speed using different number of threads?
1, 2, 4, 8, 16 or something like that?

Why you are use custom gradint appiler?

Hello, thank you for your code!
Why you are not use standart apply_gradients for rms optimizer? Any sync issue for multithreading? https://www.tensorflow.org/versions/master/api_docs/python/train/optimizers#Optimizer.apply_gradients

question about loss wrt gradient

Any plan on using MULTI-GPUs i.e more than 2 or 3 upto 8

A3C-FF seems not work well?

Hi @miyosuda, thanks for providing the code! When I experimented it with other games than pong (only the ROM name and ACTION_SIZE are modified), I found A3C-FF seems not work very well. For example, after iteration 50M, the training score for breakout is ~30, while that for space_invaders is ~600, which are lower than what is reported in the A3C paper.

Also, I found videos for breakout and space_invaders in @Itsukara 's fork, could you @Itsukara show your training details on these two games with this code?

RMRProp and use_locking = False

In tensorflow document, it says:

use_locking: If True, updating of the var, ms, and mom tensors is protected by a lock; otherwise the behavior is undefined, but may exhibit less contention.

However in the code this flag is set to False. Could this cause a problem by the racing condition?

Also, I don't understand why the original paper states it's better to share g across different threads. Is there any reason to justify this other than empirical evidences?

Would like to convert this to Keras but having issues

Would like to convert this to Keras but having issues ...has anyone tried this?


 with tf.device(self._device): 

        #'uniform', 'glorot_normal', 'glorot_uniform','lecun_uniform',lambda shape, name: normal(shape, scale=0.01, name=name)
        init = 'lecun_uniform'

        K.set_image_dim_ordering('tf')
        self.s = Input(shape=(NETWORK_IMAGE_HIEGHT, NETWORK_IMAGE_WIDTH, NETWORK_IMAGE_CHANNELS))

        K.set_learning_phase(learning_phase)

        # Convolution2D  args:  number of filters, filter size row col, stride dim, padding type
        shared = Convolution2D(16, 8, 8, name="conv1",  subsample=(4,4), activation='relu', border_mode='same', init=init)(self.s) 

        shared = Convolution2D(32, 4, 4, name="conv2", subsample=(2,2), activation='relu', border_mode='same', init=init)(shared)

        shared = Flatten()(shared)
        shared = Dense(name="h1", output_dim=256, activation='relu', init=init)(shared)

        self.pi =Dense(name="p", output_dim=action_size, activation='softmax', init=init)(shared)
        
        self.v = Dense(name="v", output_dim=1, activation='linear', init=init)(shared)

        self.policy_network = Model(input=self.s, output=self.pi)
        self.value_network  = Model(input=self.s, output=self.v)

        self.p_params = self.policy_network.trainable_weights
        self.v_params = self.value_network.trainable_weights

        self.p_out = self.policy_network(self.s)
        self.v_    = self.value_network(self.s)
        self.v_out = tf.reshape( self.v_, [-1] )

  def run_policy_and_value(self, sess, s_t): 
    pi_out, v_out = sess.run( [self.p_out, self.v_out], feed_dict = {self.s : [s_t]} )
    return (pi_out[0], v_out[0])

  def run_policy(self, sess, s_t):
    probs = self.p_out.eval(session = sess, feed_dict = {self.s : [s_t]})[0]
    return probs

  def run_value(self, sess, s_t):
    values = self.v_out.eval(session = sess, feed_dict = {self.s : [s_t]})[0]
    return values

  def get_vars(self):
    #return self.p_params
    return self.v_params

how to test the model?

Hi, I am new to Tensorflow and interested in running this project.
But I don't see test descriptions in your readme wiki. Could you please give a description how to test the model?
Thanks a lot.

Scores are not averaged using global network unlike the original paper

Dear

I would like to thank you for this great work.

It is mentioned that in read me:

"Scores are not averaged using global network unlike the original paper."

However I see using "global_network" used in a3c.py and s3c_dispaly.py

Could you please explain how it works?

Thank you

code does not work for breakout

hello, when i train with the game breakout. Maybe something is wrong, in 16m steps, the scores only reached the 50. But that worked well for pong and spaceinvaders. Does it needs some specific implementions

for breakout!
Thankyou.

Stock Trading Game using n-Step Q-Learning A3C FF ( Algorithm 2 ) and A3C LSTM ( Algorithm 3)

I do have some code for a Stock Trading game that is using Deep Q ( just standard Deep Q Learning with Experience Play, but i would like to use A3C LSTM with Experience Play as per the research paper in Keras + TensorFlow ) . Let me know if you are interested in working to incorporate the Stock trading Game into your code ( i will email you the zip code, it is 6 small python files) It is in Keras + TensorFlow .

Hogwild?

Does this A3C implementation use Hogwild? If it is not, any reason why it shouldn't affect the result when comparing to the original A3C paper that uses Hogwild?

Request for a python code for cartpole problem

Please leave a python code that can be solved using q-learning

build problems about #cmake -DUSE_SDL=ON -DUSE_RLGLUE=OFF -DBUILD_EXAMPLES=OFF .

Hi ,when I build the project.
$ git clone https://github.com/miyosuda/Arcade-Learning-Environment.git
$ cd Arcade-Learning-Environment
$ cmake -DUSE_SDL=ON -DUSE_RLGLUE=OFF -DBUILD_EXAMPLES=OFF .
$ make -j 4

the third step have an error occurred

-- The C compiler identification is AppleClang 8.0.0.8000042
-- The CXX compiler identification is AppleClang 8.0.0.8000042
-- Check for working C compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/cc
-- Check for working C compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++
-- Check for working CXX compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Could NOT find SDL (missing: SDL_LIBRARY SDL_INCLUDE_DIR)
CMake Error at CMakeLists.txt:21 (if):
if given arguments:

"SDL_FOUND" "AND" "VERSION_LESS" "2"

Unknown arguments specified

-- Configuring incomplete, errors occurred!
See also "/usr/local/share/Arcade-Learning-Environment/CMakeFiles/CMakeOutput.log".

how to see the training progress graph?

like this one:

https://github.com/miyosuda/async_deep_reinforce/blob/master/docs/graph_24h.png

which script I should use to generate such graph?

Binary files for other games?

Until now, I changed your code for gym and have used it.

However, I notice that the simulation speed critically depends on whether using gym or using ALE in your repository. My GPU is GTX TITAN X PASCAL and the initial speed for each environment is given as follows:

your ALE : about 1000 steps/sec
gym: about 500 steps/sec

I cannot figure out why such result occurs...

Anyway, for such reason, I hope to make other binary files for ALE (breakout, or whatever), and would you give some references?

use Multiprocessing instead of Threading?

The current implementation use Python Threading. I just wonder if this can be switched using Multiprocessing?

Multiprocessing performs much better on multi-core CPUs than Threading in Python.

Is this possible? did I miss something?

Running speed

Hi @miyosuda!
Thank you for your great work!

I was implementing my own A3C-FF and when I ran it, it seemed so slow in terms of 'steps per unit of time' compared to your result. So I tried running your A3C-FF of 'gym' branch with 36 threads on AWS EC2 c4.8xlarge instance, but it still seemed so slow, running about 4.6 millions steps (global steps) with 36 threads took about 11 hours.
I have no idea what's going on. How can you get the result that says it runs 472 steps per sec with 8 threads on CPU?

Hope to hear from you soon.

My implementation: https://github.com/tatsuyaokubo/async-rl

about steps related to the reward

Hi Kosuke,
I've tried your model on breakout game. The performance was amazing, the average score went up to 520 after 80M steps. It's far more better than any other model I've tried.
But the average score didn't went up much after 80M. Sometimes game costs plenty of steps. When the bricks left 1 or 2 pieces, the ball went between your board and the wall, but never hits the remain bricks.
Do you think if it's better to add steps of games as the penalty to calculate R? such as
R-=beta*sqrt(steps)
Thanks

BTW, I've do some changes
1 changed the ACTION_SIZE= 4, because breakout have 4 actions in ALE.
2 If lives is lost, treat it as terminal
#if not terminal_end:
if lives==new_lives and not terminal_end:
R = self.local_network.run_value(sess, self.game_state.s_t)
else:
#print("lives cost from %d to %d"%(lives,new_lives))
lives = new_lives

I have three issues about A3C-lstm

In the first layer of ACLSTM-Net, 'W_conv1' shape is [8,8,4,16] and what is the mean of the two 8-dim?
Input data format and what is the mean of the every dimension?
I want to realize a price-predict with A3C, but the input-data is different with this project.
In the training process, how is running in multi-threading and the order of counting gradients may be affect the updating of w&b ?

Thanks!

Change code for a different problem

How can I change this implementation to make it work for my problem. I would like to define my own state, action and reward function.

I can't train with the other games.

I want to train with the other games, so I downloaded ROMs from the site of ATARI and rewrote ROM parameter in 'constants.py'.
But I got the error message below.

...
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally
A.L.E: Arcade Learning Environment (version 0.5.1)
[Powered by Stella]
Use -help for help screen.
Game console created:
ROM file: ../roms/myrom/pinball.bin
Cart Name: Video Pinball (1980) (Atari) (PAL) [p1][!]
Cart MD5: a2424c1a0c783d7585d701b1c71b5fdc
Display Format: AUTO-DETECT ==> PAL
ROM Size: 4096
Bankswitch Type: AUTO-DETECT ==> 4K

Running ROM file...
Random seed is 0
Segmentation fault (core dumped)

I found the supported game list in
'Arcade-Learning-Environment/src/games/supported'

why recalculate pi and v?

Hello, in game_ac_network.py, def prepare_loss(self, entropy_beta), you have:

  # temporary difference (R-V) (input for policy)
  self.td = tf.placeholder("float", [None])

  value_loss = 0.5 * tf.nn.l2_loss(self.r - self.v)

But td == self.r-self.v, right?

So, why not use self.td directly instead of recalculating self.v ? Also for pi, why not pass it as placeholder?

Hope reply thanks.

miyosuda / async_deep_reinforce Goto Github PK

async_deep_reinforce's People

Contributors

Stargazers

Watchers

Forkers

async_deep_reinforce's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs