stevenpjg / ddpg-aigym Goto Github PK

Continuous control with deep reinforcement learning - Deep Deterministic Policy Gradient (DDPG) algorithm implemented in OpenAI Gym environments

License: MIT License

Python 100.00%

deep-learning reinforcement-learning tensorflow

ddpg-aigym's Introduction

ddpg-aigym

Deep Deterministic Policy Gradient

Implementation of Deep Deterministic Policy Gradiet Algorithm (Lillicrap et al.arXiv:1509.02971.) in Tensorflow

How to use

git clone https://github.com/stevenpjg/ddpg-aigym.git
cd ddpg-aigym
python main.py

During training

Once trained

Learning Curve

The learning curve for InvertedPendulum-v1 environment.

Dependencies

Tensorflow (Developed in tensorflow version 0.11.0rc0 [CPU version] [GPU version])
OpenAi gym
Mujoco

Features

Batch Normalization (improvement in learning speed)
Grad-inverter (given in arXiv: arXiv:1511.04143)

Note

To use different environment

experiment= 'InvertedPendulum-v1' #specify environments here

To use batch normalization

is_batch_norm = True #batch normalization switch

Let me know if there are any issues and clarifications regarding hyperparameter tuning.

ddpg-aigym's People

Contributors

Stargazers

Watchers

ddpg-aigym's Issues

what the grad_inverter means?

why to filter by zero, and calculate the gradients by different ways?

how to visualize the result with "episode_reward"

hi steven,my system do not have Mujoco,so I combine your code with nrod80's code(https://github.com/nrod80/ddpg-for-openai) to build a new code.But I could not visualize the result.Could you told where is the visualize API?

Error when running in batch norm mode

Hi,

When I run the code in batch-norm mode for InvertedPendulum-v1 I get the following error.
Any idea why this is happening?

Thanks

can't import gym:no module named gym,The problem is I have installed gym,why?

Number of nodes in a hidden layer are different from the DDPG paper

https://github.com/stevenpjg/ddpg-aigym/blob/master/critic_net.py#L58
According to the paper this should be 300.

Error, when ran for other environments like reacher-v1.

I tried to run this code for Reacher-v1 and Swimmer-v1 but it threw an error due to this line.
ValueError: total size of new array must be unchanged

Could you please also explain why do you even need this step for InvertedPendulum ?

A question on action_gradients in critic_net_bn.py

Hi,

I just read through your DDPG implementation, and it looks awesome. Thanks for sharing!

Currently, I feel confusion about the below code
self.action_gradients = [self.act_grad_v[0]/tf.to_float(tf.shape(self.act_grad_v[0])[0])]
in critic_net_bn.py.

Why do we add [0] after self.act_grad_v since we use a batch of actions to compute gradients?
What does "[0]" use for?

Thank you so much!

Error spotted

This line of code looks wrong. (https://github.com/stevenpjg/ddpg-aigym/blob/master/critic_net.py#L84)
It should have critic model predicting not actor model.

Need help to understand how grad-inv accelerate learning process

I hope I am not troubling you too much by asking questions.

Could you please help me to understand the notion of the recent changes made to accelerate learning ?
BTW is it converging on Reacher-v1 ?
Could you please also mention the time taken to learn and your system configuration ?
Also, look at this paper for reward scaling, it could be a reason for divergence just in case it is not converging.

A question of running speed about your code

Hello! I have run your code and there is a problem about it. It seems that the update part where tf.assign is used becomes slower as the code keeps running, and it becomes the bottleneck of running speed. I am wondering if you have come across with the same problem? If so, I am looking forward to the solution. Thanks a lot!

Run the codes in the "Reacher" task

Hi, steven! Recently, I have downloaded your codes and test it on the "Reacher" task. However, I found that with GPU-based tensorflow, it could run 200 episodes per day. It seems a bit slow. Is there anything I need to adjust to fasten the process?(I found that the usage of GPU is low, around 3%~10%, maybe the GPU is not used sufficiently) Plus, you said that we could use one more wrapper to scale the reward, can you explain it more specifically? Thanks a lot!

Error with GLEW initialization

This is the output that I got:
Creating window glfw
ERROR: GLEW initalization error: Missing GL version
My setup:
Python3.5, Ubuntu 16.04, gym from openai official github.

It si very very slow for Pendulum-v0 of classic control environment

I ran this code for Pendulum-v0 environment, its too too slow on this particular environment. But its considerably faster on InvertedPendulum-v1. Do you have any idea why is it so ?

Need help to understand a step

Could you please explain this 3 in this line https://github.com/stevenpjg/ddpg-aigym/blob/master/actor_net.py#L62 ?

Question on Loss function of Critic Network training

Hello,

I just read through your code on DDPG implementation, and it looks awesome :) Currently I have a question to consult you, and I wonder how's the curve of Q loss function looks like with respect to training time when you train Inverted Pendulum with DDPG. Actually, I also implemented the DDPG code by myself, and I noticed that Inverted Pendulum did learn something, but the Q loss was diverged, and I wonder if you have the same issue with your implementation.