Comments (8)
Ok. On my side I will launch a training on Breakout for a little sanity check.
Cause I think it's easier to see if the agent really learned something or just play randomly, indeed in Space Invaders it's pretty easy to be convinced that a full random agent is playing pretty well ^^
Concerning the sampling via Q-values or just taking new weights for the Noisy layers, I really don't know, we should maybe try both to compare (Q-values sampling may lead to way too much exploration but on the other hand in the late stage of training, the agent may have learn to ignore all incoming noise from the Noisy layers...).
from rainbow.
Looks reasonable so far. The Q-values increased rapidly, and have now stabilised (looking very similar to the values of the Double DQN).
The reward itself is clearly increasing (noisily, but at a reasonable level - not one at which I'd say there's definitely a problem). It's pretty much at the level of a trained Double DQN at about 1/3 of the training steps - but of course according to the Rainbow paper the score only really takes off after the halfway mark (and even then many runs may not work out so well, so even if this fails after the full run it's unfortunately not conclusive).
from rainbow.
I set off a run on Space Invaders last night - it's one where Rainbow is clearly better than alternatives, but it'll take a few days for it to get to the point where I can tell if that's the case or not. Out of the previous runs I've made, making sure that transitions next to the buffer aren't sampled seemed like an important fix, but I've never run anything for that long. You can have a look at the training curves in the paper to see if any other game might be useful to look at.
Non-deterministic evaluation does sound good, but I'm wondering why the random no-ops in the environment wouldn't provide a "stochastic" environment. It could well be that it's just not providing enough stochasticity. Also, not sure if sampling via Q-values or simply taking a new draw of weights via the NoisyLinear layers is the better way to go?
from rainbow.
I think it's difficult for a random agent to do really well at Space Invaders. In any case I plot Q-values on a held-out validation memory, and that's somewhat informative as to learning. Let me know how sampling Q-values goes - I've had a skim through the DM papers and they seem to average results over many testing episodes, but I'm not sure I see anything different about NoisyNet evaluation - without it you'd do a random action uniformly with a very low probability.
from rainbow.
So are your training on space invaders doing better than just a random agent now? ^^
The one I launched last night on Breakout didn't succeed to learn anything (but I think I had maybe some error with the reset fonction for Breakout).
I am launching now on Pong to really sanity check if my agent can learn anything at all.
from rainbow.
Hum ok that seems really nice to me and definitely working! (did you add not deterministic test, like by using new weights in the Noisy layers?)
I just did 5M steps on Breakout maybe it wasn't enough to see any progress at all (or maybe I just have some bugs, I will look further on this next week).
from rainbow.
I ran this as soon as I got in the last few fixes, so testing is completely deterministic.
If DM followed previous evaluation protocol, then we should actually use an ε-greedy policy with ε = 0.001 (the below quote is from the Double DQN paper on DQN evaluation but they later on mention using a lower ε):
The learned policies are evaluated for 5 mins of emulator time (18,000 frames) with an ε-greedy policy where ε = 0.05. The scores are averaged over 100 episodes.
So if you're able to do quick tests (perhaps on Pong) for evaluation, the first thing is to see if using 100 instead of 10 evaluation episodes does introduce some variance. Otherwise, given how it is trained to maximise reward even with noisy layers, taking different draws of weights seems like a better (albeit non-backwards-compatible) way of evaluating the network.
from rainbow.
Closing this issue as injecting even a small amount of noise via ε-greedy gives a sufficient distribution over test performance, and it is (AFAIK) DM's standard method of evaluating DQN variants.
from rainbow.
Related Issues (20)
- Infinite loop in ReplayMemory._get_sample_from_segment HOT 1
- Noob Question : Need help running the code. It seems to be running for forever. HOT 1
- rainbow for multiagent setting HOT 1
- Policy and reward function HOT 4
- ploting the result HOT 1
- To run a demo HOT 2
- disable env.reset() after every episode HOT 5
- --learn-start HOT 2
- Delayed reward HOT 1
- load a model HOT 1
- Explanation of Q statistics in plots for "val_mem"? HOT 1
- [question] Training speed HOT 3
- Running Rainbow on a Cluster HOT 1
- IndexError: index 262141 is out of bounds for axis 0 with size 231071 HOT 2
- Evaluate the pretrained model HOT 1
- The problem about training with GPU HOT 1
- Stuck in memory._retrieve when batch size > 32 HOT 1
- A problem about one game in ALE cannot be trained HOT 1
- Montezuma's revenge - has this been tried using this codebase? HOT 2
- Data Effiecient Rainbow with Skiing does not work HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from rainbow.