Comments (5)
Good catch. For lab 08, we are experimenting with several reward functions.
Also do you have any comments on:
random_noise = np.random.uniform(0, 1, output_size)
action = np.argmax(action_prob + random_noise)
I guess this is perhaps correct implementation:
action = np.argmax(np.random.multinomial(n=1, pvals=action_prob, size=1)[0])
I really need help on PG. Please give us more comments. Thanks in advance.
from reinforcementzerotoall.
If it's a policy gradient, the agent should follow the given policy distribution, and it shouldn't just follow argmax at least when it's training.
In the policy gradient agent for CartPole case, a single action should be chosen as following:
actions = [0, 1] # suppose there are two discrete actions
action_prob = [0.7, 0.3] # distribution given from policy network
action = np.random.choice(actions, size=1, p=action_prob)
I didn't actually run the file yet but for problems like cartpole, it's always the case that derivative free model or any simpler model will outperform policy gradient methods. So, I wouldn't be surprised if it's actually doing worse. Though I have to check if other implementations are correct. Will let you know!
from reinforcementzerotoall.
Today I tested the above code.
It turns out there was a problems with numpy.dtype
in the above code.
The default was set to numpy.int
The correct implementation of discount rewards should be:
def discount_correct_rewards(r, gamma=0.99):
""" take 1D float array of rewards and compute discounted reward """
discounted_r = np.zeros_like(r, dtype=np.float32)
running_add = 0
for t in reversed(range(len(r))):
running_add = running_add * gamma + r[t]
discounted_r[t] = running_add
# discounted_r -= discounted_r.mean()
# discounted_r /- discounted_r.std()
return discounted_r
It works well.
The reason why the original implementation works?
It's due to the normalization factor. It kinda has similar effects.
That's why people love normalization I guess lol.
However, it should work without the normalization.
The correct implementation will work always with the normalization or without the normalization.
Suggestion
- Update
discount_rewards
to the above implementation
from reinforcementzerotoall.
Please feel free to fix/send PR.
In addition, could you also fix the max 200 limit for cart pole in QN and previous examples?
Thanks in advance!
from reinforcementzerotoall.
Just noticed a fatal typo:
discounted_r /- discounted_r.std()
Please update for future readers:
discounted_r /= discounted_r.std()
from reinforcementzerotoall.
Related Issues (7)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from reinforcementzerotoall.