pemami4911 / deep-rl Goto Github PK

View Code? Open in Web Editor NEW

297.0 13.0 193.0 93 KB

Collection of Deep Reinforcement Learning algorithms

License: MIT License

Python 100.00%

openai-gym reinforcement-learning

deep-rl's Introduction

deep-rl

Collection of Deep Reinforcement Learning algorithms.

Dependencies:

Tested with Python 2.7 and Python 3.6

So far:

DDPG - Deep Deterministic Policy Gradients, evaluated on the Pendulum-v0 environment in OpenAI Gym.

Places where this code has been used

If you have used this code to do something cool, send me a link and a GIF (via email or pull request) and I'll add it

@keithmgould used the same the DDPG code to solve the inverted Pendulum task in Roboschool.
@janscholten Deep Reinforcement Learning with Feedback-based Exploration [code]

deep-rl's People

Contributors

Stargazers

Watchers

Forkers

tigerneil zeyuan1987 baoblackcoal shangxing2015 hongzimao farzanehshfard wesleyjtann kislayabhi amarjyotismruti notokay zencoding happyphonon huaqingsong melights afcruzs shalei120 sherjilozair castoryan sandyhsia es92 ijeomaonuosa aizhiluo sreejithb wangshusen marziehsaeidi szarakawka crypdick trickmeyer armahmood tjorriemorrie horsehour siridecb jadentravnik jonasnm wxywb shehroze37-zz xuexixuexihaha coreyhahn cneeruko zhanghonglishanzai nigelliyang sashamalysheva slee01 psfournier yorktsang fstonezst vectorliu jiths livey zhoujq12 bhargav5 keithmgould roncom theobserverofone judymrsd kirkados youyouhuo dapengatgoguardian shiyujin0 confiwent arbi11 woffett goingmyway afcarl timtam4869 kimang18 jangkyung dtlanghoff sourabhxiii decoderkurt unichae honorisverum wh-hacker snageshr seblopezcot kun-son yangweiyou renata-garcia mehdimashayekhi cyrilleulmi thangduong bingai godka hdg94 gjzheng93 madhubabuv pratyusv jingjerry zadiq rutvikvijjali bkramesh64 mozammalchy lee-edelen kourouklides yafeiwu rahulindoria5 lakshminarayanareddybn athakapo makaveli10 ziwlu

deep-rl's Issues

image inputs

do you have modification of DDPG for image inputs? or any github code of the other person modifiying your code.

Issues: self.action_grads = tf.gradients(self.out, self.action)

HI
see the link first :tensorflow/tensorflow#675
if self.out is a vector and self.action is a vector , the result should be a Jacobian matrix , but "Currently if you call gradients(ys, xs), it will return the sum of dy/dx over all ys for each x in xs."
So I think the expression used in your code is not the one as written in DDPG paper.

Thank u very much

converger issue

hi, Recently I used the your template to learn some simple maneuvers.

But I find the output always converge to -1 or +1 if the episodes is large enough, and if the output boundary is [-1 1].

Have you ever met with this situation? or do you know how to solve?

Best

Divide actor gradients by batch size?

Hello,

Shouldn't the actor gradients take into account that the final gradient will be an average of a batch? As a result, shouldn't the actor gradients be divided by the batch size? I believe tf.gradients just adds all of the partial derivatives for all the individual data points and does not take the mean.

Thanks for creating this tutorial!

Some questions!

Hi! I enjoy reading your blog post a lot! Thank you!!

self.actor_gradients = tf.gradients(self.scaled_out, self.network_params, -self.action_gradient)
is "self.action_gradient" getting multiplied with the original gradients of actor?
(TF's documentation on this is hard to read)
I notice that DDPG's update for the actor is multiplying on the gradient of critic w.r.t. policy's chosen actions. I've seen some other AC implementations where instead of multiplying on critic's gradient, they directly multiply policy's gradients on critic's output for policy's chosen actions.

Do you think multiplying on critic's gradient is unique to DDPG (since DDPG uses action sampling), or these other implementations are potentially wrong?

UPGRADE: Compatibility with latest version of Tensorflow (0.12.0-rc1)

The Tensorboard API has changed in the latest Tensorflow update, so need to make some changes to support the latest version

There is an error... about Monitor

AttributeError Traceback (most recent call last)
in ()
34
35 if name == 'main':
---> 36 tf.app.run()

/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.pyc in run(main, argv)
42 # Call the main function, passing through any arguments
43 # to the final program.
---> 44 _sys.exit(main(_sys.argv[:1] + flags_passthrough))
45
46

in main(_)
26 env, MONITOR_DIR, video_callable=False, force=True)
27 else:
---> 28 env = wrappers.Monitor(env, MONITOR_DIR, force=True)
29
30 train(sess, env, actor, critic)

AttributeError: 'module' object has no attribute 'Monitor'

I am using python2.7 and tensorflow 1.0

what should I do...? Could you please help me out?

Possible regression to do with batch normalization

Hi,

I just tried your code for the first time and I was disappointed to see that even after 500+ episodes, the rewards for Pendulum env were still in <-1000 area. I poked around a little and after reverting the latest commit (f242533) the algorithm works as expected and achieves good results after around 100 episodes. It seems like the commit above was a regression.

why is DDPG so unstable?

I can train a good agent, but the learning curve is quite noisy. why? is it an implementation issue or something intrinsic to DDPG?

Actor network output increases to 1, TORCS, TF 1.0.0

Hi,

Thanks for your code.

I tried to use it for training TORCS, however, my result are not good and to be specific after a few steps, actions generated by Actor network increases to 1. and stay there. Similar to the following (for the top 10 for example):

[[ 1. 1. 1.]
[ 1. 1. 1.]
[ 1. 1. 1.]
[ 1. 1. 1.]
[ 1. 1. 1.]
[ 1. 1. 1.]
[ 1. 1. 1.]
[ 1. 1. 1.]
[ 1. 1. 1.]
[ 1. 1. 1.]]

Gradients for that set:
[[ 4.80426752e-05 1.51122265e-04 -1.96302353e-05]
[ 4.80426752e-05 1.51122265e-04 -1.96302353e-05]
[ 4.80426752e-05 1.51122265e-04 -1.96302353e-05]
[ 4.80426752e-05 1.51122265e-04 -1.96302353e-05]
[ 4.80426752e-05 1.51122265e-04 -1.96302353e-05]
[ 4.80426752e-05 1.51122265e-04 -1.96302353e-05]
[ 4.80426752e-05 1.51122265e-04 -1.96302353e-05]
[ 4.80426752e-05 1.51122265e-04 -1.96302353e-05]
[ 4.80426752e-05 1.51122265e-04 -1.96302353e-05]
[ 4.80426752e-05 1.51122265e-04 -1.96302353e-05]]

I suspect the problem is some where around the following line:

Combine the gradients here

self.actor_gradients = tf.gradients(self.scaled_out, self.network_params, -self.action_gradient)

Could you tell me what do you think is the problem?

I am using tf 1.0.0 CPU version.

Thanks

A problem about the DDPG

Hi, I want to implement the DDPG algorithm, and before that, i've read your code. It's very useful. And I still have some little questions about the code.
1.As in the DDPG.py, line 61 to 66:
# This gradient will be provided by the critic network
self.action_gradient = tf.placeholder(tf.float32, [None, self.a_dim])
# Combine the gradients here
self.actor_gradients = tf.gradients(
self.scaled_out, self.network_params, -self.action_gradient)
My problem is: The tf.gradients(y,x) sums up the dy/dx for each y in ys. And in the paper, the equation of dJ/d(theta) divides N, and I wonder that whether I should write the code like this:
self.actor_gradients = tf.div(tf.gradients(
self.scaled_out, self.network_params, -self.action_gradient),N)
Looking forward to your reply.
Thank you very much.
My contact way: email: [email protected]

why ddpy.py line 99 self.action_gradient has -

the corresponding op(4) in your blog don't have it

I can't understand..

batch normalization not actually enabled?

Hi, this repo has been very helpful to me as I'm learning DDPG myself. As an exercise to make sure I understood what's going on, I re-implemented a similar DDPG setup using Keras, and in the process I noticed something -- I don't think your batch_normalization layers are ever actually learning (adjusting their weights), so they are essentially no-ops except for the small epsilon value. It looks like with tflearn you need to set is_training to true during training steps: http://tflearn.org/config/#is_training

Interestingly, with my Keras implementation I get very similar performance to yours when I disable my batch normalization layers. When I enable my batch norm layers, performance is actually much worse and the agent often doesn't solve Pendulum-v0 even after hundreds of episodes.

I found a couple discussions around the web where other people discuss the difficulties they've had getting batch normalization to work well with DDPG, in spite of what the original papers says. For example this reddit post. It all makes me very curious.

Anyway, sorry this all is mostly just for my own benefit as I'm learning, but I thought you'd like to know. Thanks again for sharing your code!

OSError: dlopen: cannot load any more object with static TLS

getting this error when running the code

In tf.gradients, why -self.action_gradien is needed?

Hi, in the code of ActorNetwork

        self.unnormalized_actor_gradients = tf.gradients(
            self.scaled_out, self.network_params, -self.action_gradient)

Why -self.action_gradient is needed here? grad_ys is -self.action_gradient , but you returned self.unnormalized_actor_gradients .

L2 weight decay for Q

The paper mentions "For Q we included L2 weight decay of 0.01 and used a discount factor of gamma = 0.99"

Does that mean that we need to add L2 regularisation to each layer in the critic network?

Maybe something like this in create_critic_network

net = tflearn.add_weights_regularizer(net, 'L2', weight_decay=0.01)

Using DDPG for Pong

Hi Patrick,
I'm trying to convert your ddpg pendulum code to solve pong. I made minimal changes like input pre-processing and modifying of input, output dimensions. Over a few iterations I notice that the paddle sticks to the bottom of the screen as the probabilities for UP action becomes negligible and DOWN becomes almost equal to one. Since the original sample is for continuous space problem, I'm guessing since I'm expecting discrete output (up or down), I missed out changing some part of the code. Could you kindly look at my code here and point me to the missing piece:
https://gist.github.com/option-greek/dfc9288d5811371f578b2f52dce29f0e

Thanks,
OG