Comments (7)
Hi, you need to make sure your model implements chainerrl.recurent.Recurrent
interface so that can be treated as a recurrent model. I guess the easiest way to do it is inheriting chainer.recurrent.RecurrentChainMixin
like
class QFunctionRecurrent(chainer.Chain, StateQFunction, RecurrentChainMixin):
, which will find L.LSTM
by searching recursively in chainer.Chain
and chainer.ChainList
.
Documentation on the usage of recurrent models is almost missing, so I opened another issue for it #83. Thanks for reporting the issue!
from chainerrl.
Thanks for your code!
Just for clarification:
If episodic_replay=true, then:
minibatch_size corresponds to the episodes used for the experience replay
and
episodic_update_len corresponds to the time steps within each of those episodes, right?
Thus, if one batch has e.g. 50 time steps and episodic_update_len=16 it will draw 16 consecutive time steps from this episodes for replay? Furthermore if episodic_update_len=None it will use all time steps within this episode?
You are correct. minibatch_size
is the number of episodes to sample for an update. Each sample episode's length is at most episodic_update_len
.
As for average_loss
, it turned out to be a bug in ChainerRL. Losses are computed and the model is updated as usual. However, the value of average_loss
is not updated at all when episodic_update=True. I'll open an issue for it and fix it soon. Thanks for reporting it!
from chainerrl.
Hi, thank you very much for your fast response! That indeed solved the issue.
Just for clarification:
If episodic_replay=true, then:
minibatch_size corresponds to the episodes used for the experience replay
and
episodic_update_len corresponds to the time steps within each of those episodes, right?
Thus, if one batch has e.g. 50 time steps and episodic_update_len=16 it will draw 16 consecutive time steps from this episodes for replay? Furthermore if episodic_update_len=None it will use all time steps within this episode?
Thanks again!
from chainerrl.
Unfortunately, my above statement about the issue being solved was a bit hastily.
I modified the the Q Function, i.e.:
class QFunction(chainer.Chain, StateQFunction,RecurrentChainMixin):
def __init__(self, n_input_channels=3, n_actions = 4, bias=0.1):
self.n_actions = n_actions
self.n_input_channels = n_input_channels
conv_layers = chainer.ChainList(
L.Convolution2D(n_input_channels, 32, 8, stride=4, bias=bias),
L.Convolution2D(32, 64, 4, stride=2, bias=bias),
L.Convolution2D(64, 64, 3, stride=1, bias=bias),
L.Convolution2D(64, 128, 7, stride=1, bias=bias)
)
lstm_layer = L.LSTM(128, 128)
a_stream = MLP(128,n_actions,[2])
v_stream = MLP(128,1,[2])
super().__init__(conv_layers=conv_layers, lstm_layer=lstm_layer, a_stream=a_stream,v_stream=v_stream)
def __call__(self, x, test=False):
"""
Args:
x (ndarray or chainer.Variable): An observation
test (bool): a flag indicating whether it is in test mode
"""
h = x
for l in self.conv_layers:
h = F.relu(l(h))
h = self.lstm_layer(h)
batch_size = x.shape[0]
ya = self.a_stream(h, test=test)
mean = F.reshape(F.sum(ya,axis=1) / self.n_actions, (batch_size,1))
ya, mean = F.broadcast(ya,mean)
ya -= mean
ys = self.v_stream(h,test=test)
ya,ys = F.broadcast(ya,ys)
q = ya+ys
return chainerrl.action_value.DiscreteActionValue(q)
with
episodic_replay = True
minibatch_size = 4
episodic_update_len = None
but still, the average_loss is 0, whereas it isn't in the non-recurrent case.
However, there is a good chance that this is due to some bug in my code.
Below I have attached the full source code (without the environment) based on the train_dqn_gym.py which you have provided .
Thanks again!
def main():
import logging
logging.basicConfig(level=logging.INFO)
parser = argparse.ArgumentParser()
parser.add_argument('--outdir', type=str, default='dqn_out')
parser.add_argument('--env', type=str, default='Pendulum-v0')
parser.add_argument('--seed', type=int, default=None)
parser.add_argument('--gpu', type=int, default=0)
parser.add_argument('--final-exploration-steps',
type=int, default= 1000*50)
parser.add_argument('--start-epsilon', type=float, default=1.0)
parser.add_argument('--end-epsilon', type=float, default=.05)
parser.add_argument('--demo', action='store_true', default=False)
parser.add_argument('--load', type=str, default=None)
parser.add_argument('--steps', type=int, default=500000)
parser.add_argument('--prioritized-replay', action='store_true')
parser.add_argument('--episodic-replay',type=bool, default=True)
parser.add_argument('--replay-start-size', type=int, default=None)
parser.add_argument('--target-update-frequency', type=int, default=1)
parser.add_argument('--target-update-method', type=str, default='soft')
parser.add_argument('--soft-update-tau', type=float, default=0.001)
parser.add_argument('--update-frequency', type=int, default=1)
parser.add_argument('--eval-n-runs', type=int, default=10)
parser.add_argument('--eval-frequency', type=int, default=50*10)
parser.add_argument('--n-hidden-channels', type=int, default=100)
parser.add_argument('--n-hidden-layers', type=int, default=2)
parser.add_argument('--gamma', type=float, default=0.99)
parser.add_argument('--minibatch-size', type=int, default=None)
parser.add_argument('--render-train', action='store_true')
parser.add_argument('--render-eval', action='store_true')
parser.add_argument('--monitor', action='store_true')
parser.add_argument('--reward-scale-factor', type=float, default=.1)
args = parser.parse_args()
args.outdir = experiments.prepare_output_dir(
args, args.outdir, argv=sys.argv)
print('Output files are saved in {}'.format(args.outdir))
if args.seed is not None:
misc.set_random_seed(args.seed)
def clip_action_filter(a):
return np.clip(a, action_space.low, action_space.high)
def make_env(for_eval):
env = gym.make(args.env)
if args.monitor:
env = gym.wrappers.Monitor(env, args.outdir)
if isinstance(env.action_space, spaces.Box):
misc.env_modifiers.make_action_filtered(env, clip_action_filter)
if not for_eval:
misc.env_modifiers.make_reward_filtered(
env, lambda x: x * args.reward_scale_factor)
if ((args.render_eval and for_eval) or
(args.render_train and not for_eval)):
misc.env_modifiers.make_rendered(env)
return env
env = make_env(for_eval=False)
timestep_limit = env.spec.tags.get(
'wrapper_config.TimeLimit.max_episode_steps')
obs_size = env.observation_space.low.size
action_space = env.action_space
n_actions = action_space.n
class QFunction(chainer.Chain, StateQFunction,RecurrentChainMixin):
def __init__(self, n_input_channels=3, n_actions = 4, bias=0.1):
self.n_actions = n_actions
self.n_input_channels = n_input_channels
conv_layers = chainer.ChainList(
L.Convolution2D(n_input_channels, 32, 8, stride=4, bias=bias),
L.Convolution2D(32, 64, 4, stride=2, bias=bias),
L.Convolution2D(64, 64, 3, stride=1, bias=bias),
L.Convolution2D(64, 128, 7, stride=1, bias=bias)
)
lstm_layer = L.LSTM(128, 128)
a_stream = MLP(128,n_actions,[2])
v_stream = MLP(128,1,[2])
super().__init__(conv_layers=conv_layers, lstm_layer=lstm_layer, a_stream=a_stream,v_stream=v_stream)
def __call__(self, x, test=False):
"""
Args:
x (ndarray or chainer.Variable): An observation
test (bool): a flag indicating whether it is in test mode
"""
h = x
for l in self.conv_layers:
h = F.relu(l(h))
h = self.lstm_layer(h)
batch_size = x.shape[0]
ya = self.a_stream(h, test=test)
mean = F.reshape(F.sum(ya,axis=1) / self.n_actions, (batch_size,1))
ya, mean = F.broadcast(ya,mean)
ya -= mean
ys = self.v_stream(h,test=test)
ya,ys = F.broadcast(ya,ys)
q = ya+ys
return chainerrl.action_value.DiscreteActionValue(q)
explorer = explorers.LinearDecayEpsilonGreedy(
args.start_epsilon, args.end_epsilon, args.final_exploration_steps,
action_space.sample)
q_func = QFunction(3,4)
opt = optimizers.Adam()
opt.setup(q_func)
rbuf_capacity = 100000
if args.episodic_replay:
print('episodic replay')
if args.minibatch_size is None:
args.minibatch_size = 4
if args.replay_start_size is None:
args.replay_start_size = 10
if args.prioritized_replay:
betasteps = \
(args.steps - timestep_limit * args.replay_start_size) \
// args.update_frequency
rbuf = replay_buffer.PrioritizedEpisodicReplayBuffer(
rbuf_capacity, betasteps=betasteps)
else:
rbuf = replay_buffer.EpisodicReplayBuffer(rbuf_capacity)
else:
if args.minibatch_size is None:
args.minibatch_size = 32
if args.replay_start_size is None:
args.replay_start_size = 1000
if args.prioritized_replay:
betasteps = (args.steps - args.replay_start_size) \
// args.update_frequency
rbuf = replay_buffer.PrioritizedReplayBuffer(
rbuf_capacity, betasteps=betasteps)
else:
rbuf = replay_buffer.ReplayBuffer(rbuf_capacity)
def phi(obs):
return (np.swapaxes(obs,0,2).astype(np.float32))/255.
gym.undo_logger_setup() # Turn off gym's default logger settings
logging.basicConfig(level=logging.DEBUG, stream=sys.stdout, format='')
agent = DoubleDQN(q_func, opt, rbuf, gpu=args.gpu, gamma=args.gamma,
explorer=explorer, replay_start_size=args.replay_start_size,
target_update_interval=args.target_update_frequency,
update_interval=args.update_frequency,
phi=phi, minibatch_size=args.minibatch_size,
target_update_method=args.target_update_method,
soft_update_tau=args.soft_update_tau,
episodic_update=args.episodic_replay, episodic_update_len=None)
if args.load:
agent.load(args.load)
eval_env = make_env(for_eval=True)
if args.demo:
mean, median, stdev = experiments.eval_performance(
env=eval_env,
agent=agent,
n_runs=args.eval_n_runs,
max_episode_len=50)
print('n_runs: {} mean: {} median: {} stdev'.format(
args.eval_n_runs, mean, median, stdev))
else:
experiments.train_agent_with_evaluation(
agent=agent, env=env, steps=args.steps,
eval_n_runs=args.eval_n_runs, eval_interval=args.eval_frequency,
outdir=args.outdir, eval_env=eval_env,
max_episode_len=50)
if __name__ == '__main__':
main()
from chainerrl.
@kfeeeeee Can you give me the complete code (including import sentences) and command line arguments?
from chainerrl.
@muupan Sure thing.
Gym-Environment:
gridworld.py
Train-Script:
train_dqn_gym.py
And command execution:
python train_dqn_gym.py --env 'Gridworld-v0'
For the rest I use the default args defined in train_dqn_gym.py
Thanks again for looking into this!
from chainerrl.
Thank you very much for your effort!
from chainerrl.
Related Issues (20)
- KeyError-249 HOT 1
- Compatibility with OpenAI Gym's VectorEnv
- Async training does not work with macOS and Python 3.8
- Fully Parameterized Quantile Function for Distributional Reinforcement Learning
- TestTrainAgentAsync.test is flaky
- Saving models at timestamps HOT 4
- What do 'agent.save' mean?? HOT 5
- Segmentation fault in Docker after importing chainerrl HOT 2
- Question about quantile huber loss function in IQN HOT 1
- I want to know delta in TRPO
- Can I get the distortion data in MujoCo?
- Tensorboard support HOT 2
- Question: How to create a model which takes two inputs? HOT 1
- (How to) Provide a random agent for evaluation HOT 1
- SARSA missing on GitHub page and API description is incorrect? HOT 3
- Downloading pretrained DQN models gives error: HTTP Error 403: Forbidden HOT 2
- RuntimeError: CUDA environment is not correctly set up HOT 3
- AdditiveGaussian Overflow
- What's does the columns in scores.txt mean? HOT 1
- Regarding the output of DQN HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from chainerrl.