keon / policy-gradient Goto Github PK

View Code? Open in Web Editor NEW

158.0 10.0 43.0 3.62 MB

Minimal Monte Carlo Policy Gradient (REINFORCE) Algorithm Implementation in Keras

License: MIT License

Python 100.00%

policy-gradient deep-reinforcement-learning keras reinforcement-learning

policy-gradient's Introduction

Policy Gradient

Minimal implementation of Stochastic Policy Gradient Algorithm in Keras

Pong Agent

This PG agent seems to get more frequent wins after about 8000 episodes. Below is the score graph.

policy-gradient's People

Contributors

Stargazers

Watchers

policy-gradient's Issues

Minor Questions

are the weights present in the repo. trained? Asking because they didn't work that well.
do you find the approach better than simple feed-forward approach?(http://karpathy.github.io/2016/05/31/rl/)
how did you narrow down on architecture?
a. why only 1 conv. layer?
b. why not deconv. layers?

Thanks.

Train agent process error

I think the code in the agent training process

policy-gradient/pg.py

Line 56 in b83f050

running_add = running_add * self.gamma + rewards[t]

has some errors. The calculation process can not get right result. You can figure it out and refer: https://github.com/gabrielgarza/openai-gym-policy-gradient/blob/master/policy_gradient_layers.py#

Incorrect normalising of discounted rewards

Hey mate,

Great work, but I think you normalization of the discounted rewards is wrong.
pg.py_line 64: rewards = rewards / np.std(rewards - np.mean(rewards))

should maybe be:

rewards = (rewards - np.mean(rewards)) / np.std(rewards - np.mean(rewards))

Why normalize predicted probabilities?

prob = aprob / np.sum(aprob)
https://github.com/keon/policy-gradient/blob/master/pg.py#L46

I am not sure if this line is really required, as they would be already normalized due to softmax. Please let me know in case I am missing something.

Loss function/Labels for neural network used?

I do understand the backpropagation in policy gradient networks, but am not sure how your code work keras's auto-differentiation.

That is, how you transform it into a supervised learning problem.
For example, the code below:

Y = self.probs + self.learning_rate * np.squeeze(np.vstack([gradients]))

Why is Y not 1-hot vector for the action taken?
You compute the gradient assuming the action is correct, Y is one-hot vector. Then you multiplies it by the reward in the corresponding time-step. But while training you feed it as the correction.
I think one could multiply the rewards by one-hot vector instead. And then feed it straight away.

If possible please clarify my doubt. :)
https://github.com/keon/policy-gradient/blob/master/pg.py#L67

keon / policy-gradient Goto Github PK

policy-gradient's Introduction

Policy Gradient

Pong Agent

policy-gradient's People

Contributors

Stargazers

Watchers

Forkers

policy-gradient's Issues

Minor Questions

Train agent process error

Incorrect normalising of discounted rewards

Why normalize predicted probabilities?

Loss function/Labels for neural network used?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs