Trying this two different q_functions: (non recurrent) <div clas

Thanks for your code! Just for clarification: <p di

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

average_loss always 0 when using episodic_replay=True (DQN) about chainerrl HOT 7 CLOSED

chainer commented on August 24, 2024

average_loss always 0 when using episodic_replay=True (DQN)

from chainerrl.

Comments (7)

muupan commented on August 24, 2024 1

Hi, you need to make sure your model implements chainerrl.recurent.Recurrent interface so that can be treated as a recurrent model. I guess the easiest way to do it is inheriting chainer.recurrent.RecurrentChainMixin like

class QFunctionRecurrent(chainer.Chain, StateQFunction, RecurrentChainMixin):

, which will find L.LSTM by searching recursively in chainer.Chain and chainer.ChainList.

Documentation on the usage of recurrent models is almost missing, so I opened another issue for it #83. Thanks for reporting the issue!

from chainerrl.

muupan commented on August 24, 2024 1

Thanks for your code!

Just for clarification:

If episodic_replay=true, then:

minibatch_size corresponds to the episodes used for the experience replay

and

episodic_update_len corresponds to the time steps within each of those episodes, right?

Thus, if one batch has e.g. 50 time steps and episodic_update_len=16 it will draw 16 consecutive time steps from this episodes for replay? Furthermore if episodic_update_len=None it will use all time steps within this episode?

You are correct. minibatch_size is the number of episodes to sample for an update. Each sample episode's length is at most episodic_update_len.

As for average_loss, it turned out to be a bug in ChainerRL. Losses are computed and the model is updated as usual. However, the value of average_loss is not updated at all when episodic_update=True. I'll open an issue for it and fix it soon. Thanks for reporting it!

from chainerrl.

kfeeeeee commented on August 24, 2024

Hi, thank you very much for your fast response! That indeed solved the issue.

Just for clarification:

If episodic_replay=true, then:

minibatch_size corresponds to the episodes used for the experience replay

and

episodic_update_len corresponds to the time steps within each of those episodes, right?

Thus, if one batch has e.g. 50 time steps and episodic_update_len=16 it will draw 16 consecutive time steps from this episodes for replay? Furthermore if episodic_update_len=None it will use all time steps within this episode?

Thanks again!

from chainerrl.

kfeeeeee commented on August 24, 2024

Unfortunately, my above statement about the issue being solved was a bit hastily.

I modified the the Q Function, i.e.:

class QFunction(chainer.Chain, StateQFunction,RecurrentChainMixin):

        def __init__(self, n_input_channels=3, n_actions = 4, bias=0.1):
            self.n_actions = n_actions
            self.n_input_channels = n_input_channels
            conv_layers = chainer.ChainList(
                L.Convolution2D(n_input_channels, 32, 8, stride=4, bias=bias),
                L.Convolution2D(32, 64, 4, stride=2, bias=bias),
                L.Convolution2D(64, 64, 3, stride=1, bias=bias),
                L.Convolution2D(64, 128, 7, stride=1, bias=bias)
                )

            lstm_layer = L.LSTM(128, 128)                     

            a_stream = MLP(128,n_actions,[2])
            v_stream = MLP(128,1,[2])

            super().__init__(conv_layers=conv_layers, lstm_layer=lstm_layer, a_stream=a_stream,v_stream=v_stream)

        def __call__(self, x, test=False):
            """
            Args:
                x (ndarray or chainer.Variable): An observation
                test (bool): a flag indicating whether it is in test mode
            """
            h = x
            for l in self.conv_layers:
                h = F.relu(l(h))
            h = self.lstm_layer(h)

            batch_size = x.shape[0]
            ya = self.a_stream(h, test=test)
            mean = F.reshape(F.sum(ya,axis=1) / self.n_actions, (batch_size,1))
            ya, mean = F.broadcast(ya,mean)
            ya -= mean

            ys = self.v_stream(h,test=test)
            
            ya,ys = F.broadcast(ya,ys)
            q = ya+ys
            return chainerrl.action_value.DiscreteActionValue(q)

with

episodic_replay = True
minibatch_size = 4
episodic_update_len = None

but still, the average_loss is 0, whereas it isn't in the non-recurrent case.

However, there is a good chance that this is due to some bug in my code.

Below I have attached the full source code (without the environment) based on the train_dqn_gym.py which you have provided .

Thanks again!

def main():
    import logging
    logging.basicConfig(level=logging.INFO)

    parser = argparse.ArgumentParser()
    parser.add_argument('--outdir', type=str, default='dqn_out')
    parser.add_argument('--env', type=str, default='Pendulum-v0')
    parser.add_argument('--seed', type=int, default=None)
    parser.add_argument('--gpu', type=int, default=0)
    parser.add_argument('--final-exploration-steps',
                        type=int, default= 1000*50)
    parser.add_argument('--start-epsilon', type=float, default=1.0)
    parser.add_argument('--end-epsilon', type=float, default=.05)
    parser.add_argument('--demo', action='store_true', default=False)
    parser.add_argument('--load', type=str, default=None)
    parser.add_argument('--steps', type=int, default=500000)
    parser.add_argument('--prioritized-replay', action='store_true')
    parser.add_argument('--episodic-replay',type=bool, default=True)
    parser.add_argument('--replay-start-size', type=int, default=None)
    parser.add_argument('--target-update-frequency', type=int, default=1)
    parser.add_argument('--target-update-method', type=str, default='soft')
    parser.add_argument('--soft-update-tau', type=float, default=0.001)
    parser.add_argument('--update-frequency', type=int, default=1)
    parser.add_argument('--eval-n-runs', type=int, default=10)
    parser.add_argument('--eval-frequency', type=int, default=50*10)
    parser.add_argument('--n-hidden-channels', type=int, default=100)
    parser.add_argument('--n-hidden-layers', type=int, default=2)
    parser.add_argument('--gamma', type=float, default=0.99)
    parser.add_argument('--minibatch-size', type=int, default=None)
    parser.add_argument('--render-train', action='store_true')
    parser.add_argument('--render-eval', action='store_true')
    parser.add_argument('--monitor', action='store_true')
    parser.add_argument('--reward-scale-factor', type=float, default=.1)
    args = parser.parse_args()

    

    args.outdir = experiments.prepare_output_dir(
        args, args.outdir, argv=sys.argv)
    print('Output files are saved in {}'.format(args.outdir))

    if args.seed is not None:
        misc.set_random_seed(args.seed)

    def clip_action_filter(a):
        return np.clip(a, action_space.low, action_space.high)

    def make_env(for_eval):
        env = gym.make(args.env)
        if args.monitor:
            env = gym.wrappers.Monitor(env, args.outdir)
        if isinstance(env.action_space, spaces.Box):
            misc.env_modifiers.make_action_filtered(env, clip_action_filter)
        if not for_eval:
            misc.env_modifiers.make_reward_filtered(
                env, lambda x: x * args.reward_scale_factor)
        if ((args.render_eval and for_eval) or
                (args.render_train and not for_eval)):
            misc.env_modifiers.make_rendered(env)
        return env

    env = make_env(for_eval=False)
    timestep_limit = env.spec.tags.get(
        'wrapper_config.TimeLimit.max_episode_steps')
    obs_size = env.observation_space.low.size
    action_space = env.action_space

    n_actions = action_space.n

    class QFunction(chainer.Chain, StateQFunction,RecurrentChainMixin):

        def __init__(self, n_input_channels=3, n_actions = 4, bias=0.1):
            self.n_actions = n_actions
            self.n_input_channels = n_input_channels
            conv_layers = chainer.ChainList(
                L.Convolution2D(n_input_channels, 32, 8, stride=4, bias=bias),
                L.Convolution2D(32, 64, 4, stride=2, bias=bias),
                L.Convolution2D(64, 64, 3, stride=1, bias=bias),
                L.Convolution2D(64, 128, 7, stride=1, bias=bias)
                )

            lstm_layer = L.LSTM(128, 128)                     

            a_stream = MLP(128,n_actions,[2])
            v_stream = MLP(128,1,[2])

            super().__init__(conv_layers=conv_layers, lstm_layer=lstm_layer, a_stream=a_stream,v_stream=v_stream)

        def __call__(self, x, test=False):
            """
            Args:
                x (ndarray or chainer.Variable): An observation
                test (bool): a flag indicating whether it is in test mode
            """
            h = x
            for l in self.conv_layers:
                h = F.relu(l(h))
            h = self.lstm_layer(h)

            batch_size = x.shape[0]
            ya = self.a_stream(h, test=test)
            mean = F.reshape(F.sum(ya,axis=1) / self.n_actions, (batch_size,1))
            ya, mean = F.broadcast(ya,mean)
            ya -= mean

            ys = self.v_stream(h,test=test)
            
            ya,ys = F.broadcast(ya,ys)
            q = ya+ys
            return chainerrl.action_value.DiscreteActionValue(q)

    
    explorer = explorers.LinearDecayEpsilonGreedy(
    args.start_epsilon, args.end_epsilon, args.final_exploration_steps,
    action_space.sample)
    q_func = QFunction(3,4)

    opt = optimizers.Adam()
    opt.setup(q_func)

    rbuf_capacity = 100000
    if args.episodic_replay:
        print('episodic replay')
        if args.minibatch_size is None:
            args.minibatch_size = 4
        if args.replay_start_size is None:
            args.replay_start_size = 10
        if args.prioritized_replay:
            betasteps = \
                (args.steps - timestep_limit * args.replay_start_size) \
                // args.update_frequency
            rbuf = replay_buffer.PrioritizedEpisodicReplayBuffer(
                rbuf_capacity, betasteps=betasteps)
        else:
            rbuf = replay_buffer.EpisodicReplayBuffer(rbuf_capacity)
    else:
        if args.minibatch_size is None:
            args.minibatch_size = 32
        if args.replay_start_size is None:
            args.replay_start_size = 1000
        if args.prioritized_replay:
            betasteps = (args.steps - args.replay_start_size) \
                // args.update_frequency
            rbuf = replay_buffer.PrioritizedReplayBuffer(
                rbuf_capacity, betasteps=betasteps)
        else:
            rbuf = replay_buffer.ReplayBuffer(rbuf_capacity)

    def phi(obs):
        return (np.swapaxes(obs,0,2).astype(np.float32))/255.

    
    

    gym.undo_logger_setup()  # Turn off gym's default logger settings
    logging.basicConfig(level=logging.DEBUG, stream=sys.stdout, format='')

    agent = DoubleDQN(q_func, opt, rbuf, gpu=args.gpu, gamma=args.gamma,
                explorer=explorer, replay_start_size=args.replay_start_size,
                target_update_interval=args.target_update_frequency,
                update_interval=args.update_frequency,
                phi=phi, minibatch_size=args.minibatch_size,
                target_update_method=args.target_update_method,
                soft_update_tau=args.soft_update_tau,
                episodic_update=args.episodic_replay, episodic_update_len=None)

    if args.load:
        agent.load(args.load)

    eval_env = make_env(for_eval=True)

    if args.demo:
        mean, median, stdev = experiments.eval_performance(
            env=eval_env,
            agent=agent,
            n_runs=args.eval_n_runs,
            max_episode_len=50)
        print('n_runs: {} mean: {} median: {} stdev'.format(
            args.eval_n_runs, mean, median, stdev))
    else:
        experiments.train_agent_with_evaluation(
            agent=agent, env=env, steps=args.steps,
            eval_n_runs=args.eval_n_runs, eval_interval=args.eval_frequency,
            outdir=args.outdir, eval_env=eval_env,
            max_episode_len=50)


if __name__ == '__main__':
    main()

from chainerrl.

muupan commented on August 24, 2024

@kfeeeeee Can you give me the complete code (including import sentences) and command line arguments?

from chainerrl.

kfeeeeee commented on August 24, 2024

@muupan Sure thing.

Gym-Environment:
gridworld.py

Train-Script:
train_dqn_gym.py

And command execution:
python train_dqn_gym.py --env 'Gridworld-v0'

For the rest I use the default args defined in train_dqn_gym.py

Thanks again for looking into this!

from chainerrl.

kfeeeeee commented on August 24, 2024

Thank you very much for your effort!

from chainerrl.

average_loss always 0 when using episodic_replay=True (DQN) about chainerrl HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs